This family of string functions was inspired by Python's str.split
,
str.rsplit
, and str.splitlines
methods. These string splitting
methods are well-designed and often do what you want. But they're
surprisingly narrow and opinionated. What if your use case doesn't
map neatly to one of these functions? str.split
supports two
very specific modes of operation--unless you want to split your
string in exactly one of those two modes, you probably can't use
str.split
to solve your problem.
So what can you use? There's
re.split
,
but that can be hard to use.1
Regular expressions can be so hard to get right, and the
semantics of re.split
are subtly different from the usual
string splitting functions. Not to mention, it doesn't support
reverse!
Now there's a new answer:
multisplit
.
The goal of multisplit
is to be the be-all end-all string splitting function.
It's designed to supercede every mode of operation provided by
str.split
, str.rsplit
, and str.splitlines
, and it
can even replace str.partition
and str.rpartition
too.
multisplit
does it all!
The downside of multisplit
's
awesome flexibility is that it can be hard to use... after all,
it takes five keyword-only parameters. However, these parameters
and their defaults are designed to be easy to remember.
The best way to cope with
multisplit
's
complexity is to use it as a building block for your own
text splitting functions. For example, big uses
multisplit
to implement
multipartition
,
normalize_whitespace
,
lines
,
and several other functions.
Using multisplit
To use
multisplit
,
pass in the string you want to split, the separators you
want to split on, and tweak its behavior with its five
keyword arguments. It returns an iterator that yields
string segments from the original string in your preferred
format. The separator list is optional; if you don't
pass one in, it defaults to an iterable of whitespace separators
(either
big.whitespace
or
big.ascii_whitespace
,
as appropriate).
The cornerstone of multisplit
is the separators
argument.
This is an iterable of strings, of the same type (str
or bytes
)
as the string you want to split (s
). multisplit
will split
the string at each non-overlapping instance of any string
specified in separators
.
multisplit
lets you fine-tune its behavior via five keyword-only
parameters:
keep
lets you include the separator strings in the output,
in a number of different formats.separate
lets you specify whether adjacent separator strings
should be grouped together (like str.split
operating on
whitespace) or regarded as separate (like str.split
when
you pass in an explicit separator).strip
lets you strip separator strings from the beginning,
end, or both ends of the string you're splitting. It also
supports a special progressive mode that duplicates the
behavior of str.split
when you use None
as the separator.maxsplit
lets you specify the maximum number of times to
split the string, exactly like the maxsplit
argument to str.split
.reverse
makes multisplit
behave like str.rsplit
,
starting at the end of the string and working backwards.
(This only changes the behavior of multisplit
if you use
maxsplit
, or if your string contains overlapping separators.)
To make it slightly easier to remember, all these keyword-only
parameters default to a false value. (Well, technically,
maxsplit
defaults to the special value -1
, for compatibility
with str.split
. But that's its special "don't do anything"
magic value. All the other keyword-only parameters default
to False
.)
multisplit
also inspired multistrip
and multipartition
,
which also take this same separators
arguments. There are also
other big functions that take a separators
argument,
for example comment_markers
for
lines_filter_line_comment_lines
.)
Demonstrations of each multisplit
keyword-only parameter
To give you a sense of how the five keyword-only parameters changes the behavior of
multisplit
,
here's a breakdown of each of these parameters with examples.
maxsplit
-
maxsplit
specifies the maximum number of times the string should be split.
It behaves the same as the maxsplit
parameter to str.split
.
The default value of -1
means "split as many times as you can". In our
example here, the string can be split a maximum of three times. Therefore,
specifying a maxsplit
of -1
is equivalent to specifying a maxsplit
of
2
or greater:
>>> list(big.multisplit('apple^banana_cookie', ('_', '^')))
['apple', 'banana', 'cookie']
>>> list(big.multisplit('apple^banana_cookie', ('_', '^'), maxsplit=0))
['appleXbananaYcookie']
>>> list(big.multisplit('apple^banana_cookie', ('_', '^'), maxsplit=1))
['apple', 'bananaYcookie']
>>> list(big.multisplit('apple^banana_cookie', ('_', '^'), maxsplit=2))
['apple', 'banana', 'cookie']
>>> list(big.multisplit('apple^banana_cookie', ('_', '^'), maxsplit=3))
['apple', 'banana', 'cookie']
maxsplit
has interactions with reverse
and strip
. For more
information, see the documentation regarding those parameters below.
keep
-
keep
indicates whether or not multisplit
should preserve the separator
strings in the strings it yields. It supports four values: false, true,
and the special values ALTERNATING
and AS_PAIRS
.
When keep
is false, multisplit
throws away the separator strings;
they won't appear in the output.
>>> list(big.multisplit('apple#banana-cookie', ('#', '-')))
['apple', 'banana', 'cookie']
>>> list(big.multisplit('apple-banana#cookie', ('#', '-'), keep=False))
['apple', 'banana', 'cookie']
When keep
is true, multisplit
keeps the separators, appending them to
the end of the separated string:
>>> list(big.multisplit('apple$banana~cookie', ('$', '~'), keep=True))
['appleX', 'bananaY', 'cookie']
When keep
is ALTERNATING
, multisplit
keeps the separators as separate
strings. The first string yielded is always a non-separator string, and
from then on it always alternates between a separator string and a non-separator
string. Put another way, if you store the output of multisplit
in a list,
entries with an even-numbered index (0, 2, 4, ...) are always non-separator strings,
and entries with an odd-numbered index (1, 3, 5, ...) are always separator strings.
>>> list(big.multisplit('appleXbananaYcookie', ('X', 'Y'), keep=big.ALTERNATING))
['apple', 'X', 'banana', 'Y', 'cookie']
Note that ALTERNATING
always emits an odd number of strings, and the first and
last strings yielded are always non-separator strings. Like str.split
,
if the string you're splitting starts or ends with a separator string,
multisplit
will emit an empty string at the beginning or end, to preserve
the "always begin and end with non-separator string" invariant:
>>> list(big.multisplit('1a1z1', ('1',), keep=big.ALTERNATING))
['', '1', 'a', '1', 'z', '1', '']
Finally, when keep
is AS_PAIRS
, multisplit
keeps the separators as separate
strings. But it doesn't yield bare strings; instead, it yields 2-tuples of strings.
Every 2-tuple contains a non-separator string followed by a separator string.
If the original string starts with a separator, the first 2-tuple will contain
an empty non-separator string and the separator:
>>> list(big.multisplit('^apple-banana^cookie', ('-', '^'), keep=big.AS_PAIRS))
[('', '^'), ('apple', '-'), ('banana', '^'), ('cookie', '')]
The last 2-tuple will always contain an empty separator string:
>>> list(big.multisplit('apple*banana+cookie', ('*', '+'), keep=big.AS_PAIRS))
[('apple', '*'), ('banana', '+'), ('cookie', '')]
>>> list(big.multisplit('apple*banana+cookie***', ('*', '+'), keep=big.AS_PAIRS, strip=True))
[('apple', '*'), ('banana', '+'), ('cookie', '')]
(This rule means that AS_PAIRS
always emits an even number of strings.
Contrast that with ALTERNATING
, which always emits an odd number of strings,
and the last string it emits is always a non-separator string. Put another
way: if you ignore the tuples, the list of strings emitted by AS_PAIRS
is the
same as those emitted by ALTERNATING
, except AS_PAIRS
appends an empty
string.)
Because of this rule, if the original string ends with a separator,
and multisplit
doesn't strip
the right side, the final tuple
emitted by AS_PAIRS
will be a 2-tuple containing two empty strings:
>>> list(big.multisplit('appleXbananaYcookieX', ('X', 'Y'), keep=big.AS_PAIRS))
[('apple', 'X'), ('banana', 'Y'), ('cookie', 'X'), ('', '')]
This looks strange and unnecessary. But it is what you want.
This odd-looking behavior is discussed at length in the section below, titled
Why do you sometimes get empty strings when you split?
The behavior of keep
can be affected by the value of separate
.
For more information, see the next section, on separate
.
separate
-
separate
indicates whether multisplit should consider adjacent
separator strings in s
as one separator or as multiple separators
each separated by a zero-length string. It can be either false or
true.
>>> list(big.multisplit('apple=?banana?=?cookie', ('=', '?')))
['apple', 'banana', 'cookie']
>>> list(big.multisplit('apple=?banana?=?cookie', ('=', '?'), separate=False))
['apple', 'banana', 'cookie']
>>> list(big.multisplit('apple=?banana?=?cookie', ('=', '?'), separate=True))
['apple', '', 'banana', '', '', 'cookie']
If separate
and keep
are both true values, and your string
has multiple adjacent separators, multisplit
will view s
as having zero-length non-separator strings between the
adjacent separators:
>>> list(big.multisplit('appleXYbananaYXYcookie', ('X', 'Y'), separate=True, keep=True))
['appleX', 'Y', 'bananaY', 'X', 'Y', 'cookie']
>>> list(big.multisplit('appleXYbananaYXYcookie', ('X', 'Y'), separate=True, keep=big.AS_PAIRS))
[('apple', 'X'), ('', 'Y'), ('banana', 'Y'), ('', 'X'), ('', 'Y'), ('cookie', '')]
strip
-
strip
indicates whether multisplit should strip separators from
the beginning and/or end of s
. It supports five values:
false, true, big.LEFT
, big.RIGHT
, and big.PROGRESSIVE
.
By default, strip
is false, which means it doesn't strip any
leading or trailing separators:
>>> list(big.multisplit('%|apple%banana|cookie|%|', ('%', '|')))
['', 'apple', 'banana', 'cookie', '']
Setting strip
to true strips both leading and trailing separators:
>>> list(big.multisplit('%|apple%banana|cookie|%|', ('%', '|'), strip=True))
['apple', 'banana', 'cookie']
big.LEFT
and big.RIGHT
tell multistrip
to only strip on that
side of the string:
>>> list(big.multisplit('.?apple.banana?cookie.?.', ('.', '?'), strip=big.LEFT))
['apple', 'banana', 'cookie', '']
>>> list(big.multisplit('.?apple.banana?cookie.?.', ('.', '?'), strip=big.RIGHT))
['', 'apple', 'banana', 'cookie']
big.PROGRESSIVE
duplicates a specific behavior of str.split
when using
maxsplit
. It always strips on the left, but it only strips on the right
if the string is completely split. If maxsplit
is reached before the entire
string is split, and strip
is big.PROGRESSIVE
, multisplit
won't strip
the right side of the string. Note in this example how the trailing separator
Y
isn't stripped from the input string when maxsplit
is less than 3
.
>>> list(big.multisplit('^apple^banana_cookie_', ('^', '_'), strip=big.PROGRESSIVE))
['apple', 'banana', 'cookie']
>>> list(big.multisplit('^apple^banana_cookie_', ('^', '_'), maxsplit=0, strip=big.PROGRESSIVE))
['apple^banana_cookie_']
>>> list(big.multisplit('^apple^banana_cookie_', ('^', '_'), maxsplit=1, strip=big.PROGRESSIVE))
['apple', 'banana_cookie_']
>>> list(big.multisplit('^apple^banana_cookie_', ('^', '_'), maxsplit=2, strip=big.PROGRESSIVE))
['apple', 'banana', 'cookie_']
>>> list(big.multisplit('^apple^banana_cookie_', ('^', '_'), maxsplit=3, strip=big.PROGRESSIVE))
['apple', 'banana', 'cookie']
>>> list(big.multisplit('^apple^banana_cookie_', ('^', '_'), maxsplit=4, strip=big.PROGRESSIVE))
['apple', 'banana', 'cookie']
reverse
-
reverse
specifies where multisplit
starts parsing the string--from
the beginning, or the end--and in what direction it moves when parsing
the string--towards the end, or towards the beginning_ It only supports
two values: when it's false, multisplit
starts at the beginning of the
string, and parses moving to the right (towards the end of the string).
But when reverse
is true, multisplit
starts at the end of the
string, and parses moving to the left (towards the beginning
of the string).
This has two noticable effects on multisplit
's output. First, this
changes which splits are kept when maxsplit
is less than the total number
of splits in the string. When reverse
is true, the splits are counted
starting on the right and moving towards the left:
>>> list(big.multisplit('apple-banana|cookie', ('-', '|'), reverse=True))
['apple', 'banana', 'cookie']
>>> list(big.multisplit('apple-banana|cookie', ('-', '|'), maxsplit=0, reverse=True))
['apple-banana|cookie']
>>> list(big.multisplit('apple-banana|cookie', ('-', '|'), maxsplit=1, reverse=True))
['apple-banana', 'cookie']
>>> list(big.multisplit('apple-banana|cookie', ('-', '|'), maxsplit=2, reverse=True))
['apple', 'banana', 'cookie']
>>> list(big.multisplit('apple-banana|cookie', ('-', '|'), maxsplit=3, reverse=True))
['apple', 'banana', 'cookie']
The second effect is far more subtle. It's only relevant when splitting strings
containing multiple overlapping separators. When reverse
is false, and there
are two (or more) overlapping separators, the string is split by the leftmost
overlapping separator. When reverse
is true, and there are two (or more)
overlapping separators, the string is split by the rightmost overlapping
separator.
Consider these two calls to multisplit
. The only difference between them is
the value of reverse
. They produce different results, even though neither
one uses maxsplit
.
>>> list(big.multisplit('appleXYZbananaXYZcookie', ('XY', 'YZ')))
['apple', 'Zbanana', 'Zcookie']
>>> list(big.multisplit('appleXYZbananaXYZcookie', ('XY', 'YZ'), reverse=True))
['appleX', 'bananaX', 'cookie']
Reimplementing library functions using multisplit
Here are some examples of how you could use
multisplit
to replace some common Python string splitting methods. These exactly duplicate the
behavior of the originals.
def _multisplit_to_split(s, sep, maxsplit, reverse):
separate = sep != None
if separate:
strip = False
else:
sep = big.ascii_whitespace if isinstance(s, bytes) else big.whitespace
strip = big.PROGRESSIVE
result = list(big.multisplit(s, sep,
maxsplit=maxsplit, reverse=reverse,
separate=separate, strip=strip))
if not separate:
if result and (not result[-1]):
result.pop()
return result
def str_split(s, sep=None, maxsplit=-1):
return _multisplit_to_split(s, sep, maxsplit, False)
def str_rsplit(s, sep=None, maxsplit=-1):
return _multisplit_to_split(s, sep, maxsplit, True)
def str_splitlines(s, keepends=False):
linebreaks = big.ascii_linebreaks if isinstance(s, bytes) else big.linebreaks
l = list(big.multisplit(s, linebreaks,
keep=keepends, separate=True, strip=False))
if l and not l[-1]:
l.pop()
return l
def _partition_to_multisplit(s, sep, reverse):
if not sep:
raise ValueError("empty separator")
l = tuple(big.multisplit(s, (sep,),
keep=big.ALTERNATING, maxsplit=1, reverse=reverse, separate=True))
if len(l) == 1:
empty = b'' if isinstance(s, bytes) else ''
if reverse:
l = (empty, empty) + l
else:
l = l + (empty, empty)
return l
def str_partition(s, sep):
return _partition_to_multisplit(s, sep, False)
def str_rpartition(s, sep):
return _partition_to_multisplit(s, sep, True)
You wouldn't want to use these, of course--Python's built-in
functions are so much faster!
Why do you sometimes get empty strings when you split?
Sometimes when you split using
multisplit
,
you'll get empty strings in the return value. This might be unexpected,
violating the Principle Of Least Astonishment.
But there are excellent reasons for this behavior.
Let's start by observing what str.split
does. str.split
really has two
major modes of operation: when you don't pass in a separator (or pass in None
for the
separator), and when you pass in an explicit separator string. In this latter mode,
the documentation says it regards every instance of a separator string as an individual
separator splitting the string. What does that mean? Watch what happens when you have
two adjacent separators in the string you're splitting:
>>> '1,2,,3'.split(',')
['1', '2', '', '3']
What's that empty string doing between '2'
and '3'
? Here's how you should think about it:
when you pass in an explicit separator, str.split
splits at every occurance of that
separator in the string. It always splits the string into two places, whenever there's
a separator. And when there are two adjacent separators, conceptually, they have a
zero-length string in between them:
>>> '1,2,,3'[4:4]
''
The empty string in the output of str.split
represents the fact that there
were two adjacent separators. If str.split
didn't add that empty string,
the output would look like this:
['1', '2', '3']
But then it'd be indistinguishable from splitting the same string without
two separators in a row:
>>> '1,2,3'.split(',')
['1', '2', '3']
This difference is crucial when you want to reconstruct the original string from
the split list. str.split
with a separator should always be reversable using
str.join
, and with that empty string there it works correctly:
>>> ','.join(['1', '2', '3'])
'1,2,3'
>>> ','.join(['1', '2', '', '3'])
'1,2,,3'
Now take a look at what happens when the string
you're splitting starts or ends with a separator:
>>> ',1,2,3,'.split(',')
['', '1', '2', '3', '']
This might seem weird. But, just like with two adjacent separators,
this behavior is important for consistency. Conceptually there's
a zero-length string between the beginning of the string and the first
comma. And str.join
needs those empty strings in order to correctly
recreate the original string.
>>> ','.join(['', '1', '2', '3', ''])
',1,2,3,'
Naturally,
multisplit
lets you duplicate this behavior. When you want
multisplit
to behave just like str.split
does with an explicit separator
string, just pass in keep=False
, separate=True
, and strip=False
.
That is, if a
and b
are strings,
big.multisplit(a, (b,), keep=False, separate=True, strip=False)
always produces the same output as
a.split(b)
For example, here's
multisplit
splitting the strings we've been playing with, using these parameters:
>>> list(big.multisplit('1,2,,3', (',',), keep=False, separate=True, strip=False))
['1', '2', '', '3']
>>> list(big.multisplit(',1,2,3,', (',',), keep=False, separate=True, strip=False))
['', '1', '2', '3', '']
This "emit an empty string" behavior also has ramifications when keep
isn't false.
The behavior of keep=True
is easy to predict; multisplit
just appends the separators
to the previous string segment:
>>> list(big.multisplit('1,2,,3', (',',), keep=True, separate=True, strip=False))
['1,', '2,', ',', '3']
>>> list(big.multisplit(',1,2,3,', (',',), keep=True, separate=True, strip=False))
[',', '1,', '2,', '3,', '']
The principle here is that, when you use keep=True
, you should be able to reconstitute
the original string with ''.join
:
>>> ''.join(['1,', '2,', ',', '3'])
'1,2,,3'
>>> ''.join([',', '1,', '2,', '3,', ''])
',1,2,3,'
keep=big.ALTERNATING
is much the same, except we insert the separators as their
own segments, rather than appending each one to the previous segment:
>>> list(big.multisplit('1,2,,3', (',',), keep=big.ALTERNATING, separate=True, strip=False))
['1', ',', '2', ',', '', ',', '3']
>>> list(big.multisplit(',1,2,3,', (',',), keep=big.ALTERNATING, separate=True, strip=False))
['', ',', '1', ',', '2', ',', '3', ',', '']
Remember, ALTERNATING
output always begins and ends with a non-separator string.
If the string you're splitting begins or ends with a separator, the output
from multisplit
specifying keep=ALTERNATING
will correspondingly begin or end
with an empty string.
And, as with keep=True
, you can also recreate the original string by passing
these arrays in to ''.join
:
>>> ''.join(['1', ',', '2', ',', '', ',', '3'])
'1,2,,3'
>>> ''.join(['', ',', '1', ',', '2', ',', '3', ',', ''])
',1,2,3,'
Finally there's keep=big.AS_PAIRS
. The behavior here seemed so strange,
initially I thought it was wrong. But I've given it a lot of thought, and
I've convinced myself that this is correct:
>>> list(big.multisplit('1,2,,3', (',',), keep=big.AS_PAIRS, separate=True, strip=False))
[('1', ','), ('2', ','), ('', ','), ('3', '')]
>>> list(big.multisplit(',1,2,3,', (',',), keep=big.AS_PAIRS, separate=True, strip=False))
[('', ','), ('1', ','), ('2', ','), ('3', ','), ('', '')]
That tuple at the end, just containing two empty strings:
('', '')
It's so strange. How can that be right?
In short, it's similar to the str.split
situation.
When called with keep=AS_PAIRS
,
multisplit
guarantees that the final tuple will contain an empty separator string.
If the string you're splitting ends with a separator, it must emit
the empty non-separator string, followed by the empty separator string.
Think of it this way: with the tuple of empty strings there, you can easily
convert one keep
format into any another. (Provided that you know
what the separators were--either the source keep
format was not false,
or you only used one separator string when calling multisplit
).
Without that tuple of empty strings at the end, you'd also have to have an
if
statement to add or remove empty stuff from the end.
I'll demonstrate this with a simple example. Here's the output of
multisplit
splitting the string '1a1z1'
by the separator '1'
,
in each of the four keep
formats:
>>> list(big.multisplit('1a1z1', '1', keep=False))
['', 'a', 'z', '']
>>> list(big.multisplit('1a1z1', '1', keep=True))
['1', 'a1', 'z1', '']
>>> list(big.multisplit('1a1z1', '1', keep=big.ALTERNATING))
['', '1', 'a', '1', 'z', '1', '']
>>> list(big.multisplit('1a1z1', '1', keep=big.AS_PAIRS))
[('', '1'), ('a', '1'), ('z', '1'), ('', '')]
Because the AS_PAIRS
output ends with that tuple of empty
strings, we can mechanically convert it into any of the other
formats, like so:
>>> result = list(big.multisplit('1a1z1', '1', keep=big.AS_PAIRS))
>>> result
[('', '1'), ('a', '1'), ('z', '1'), ('', '')]
>>> [s[0] for s in result]
['', 'a', 'z', '']
>>> [s[0]+s[1] for s in result]
['1', 'a1', 'z1', '']
>>> [s for t in result for s in t][:-1]
['', '1', 'a', '1', 'z', '1', '']
If the AS_PAIRS
output didn't end with that tuple of empty strings,
you'd need to add an if
statement to restore the trailing empty
strings as needed.
Other differences between multisplit and str.split
str.split
returns an empty list when you split an
empty string by whitespace:
>>> ''.split()
[]
But not when you split by an explicit separator:
>>> ''.split('x')
['']
multisplit
is consistent here. If you split an empty string, it always returns an empty string,
as long as the separators are valid:
>>> list(big.multisplit(''))
['']
>>> list(big.multisplit('', ('a', 'b', 'c')))
['']
Similarly, when splitting a string that only contains whitespace, str.split
also
returns an empty list:
>>> ' '.split()
[]
This is really the same as "splitting an empty string", because when str.split
splits on whitespace, the first thing it does is strip leading whitespace.
If you multisplit
a string that only contains whitespace, and you split on whitespace characters,
it returns two empty strings:
>>> list(big.multisplit(' '))
['', '']
This is because the string conceptually starts with a zero-length string,
then has a run of whitespace characters, then ends with another zero-length
string. So those two empty strings are the leading and trailing zero-length
strings, separated by whitespace. If you tell
multisplit
to also strip the string, you'll get back a single empty string:
>>> list(big.multisplit(' ', strip=True))
['']
And
multisplit
behaves consistently even when you use different separators:
>>> list(big.multisplit('ababa', 'ab'))
['', '']
>>> list(big.multisplit('ababa', 'ab', strip=True))
['']
And I should know--multisplit
is implemented using re.split
!