=========
sre_yield
Quick Start
The goal of sre_yield
is to efficiently generate all values that can match a
given regular expression, or count possible matches efficiently. It uses the
parsed regular expression, so you get a much more accurate result than trying
to just split strings.
.. code-block:: pycon
>>> s = 'foo|ba[rz]'
>>> s.split('|') # bad
['foo', 'ba[rz]']
>>> import sre_yield
>>> list(sre_yield.AllStrings(s)) # better
['foo', 'bar', 'baz']
It does this by walking the tree as constructed by sre_parse
(same thing
used internally by the re
module), and constructing chained/repeating
iterators as appropriate. There may be duplicate results, depending on your
input string though -- these are cases that sre_parse
did not optimize.
.. code-block:: pycon
>>> import sre_yield
>>> list(sre_yield.AllStrings('.|a', charset='ab'))
['a', 'b', 'a']
...and happens in simpler cases too:
.. code-block:: pycon
>>> list(sre_yield.AllStrings('a|a'))
['a', 'a']
Quirks
The membership check, 'abc' in values_obj
is by necessity fullmatch -- it
must cover the entire string. Imagine that it has ^(...)$
around it.
Because re.search
can match anywhere in an arbitrarily string, emulating
this would produce a large number of junk matches -- probably not what you
want. (If that is what you want, add a .*
on either side.)
Here's a quick example, using the presidents regex from http://xkcd.com/1313/
.. code-block:: pycon
>>> s = 'bu|[rn]t|[coy]e|[mtg]a|j|iso|n[hl]|[ae]d|lev|sh|[lnd]i|[po]o|ls'
>>> import re
>>> re.search(s, 'kennedy') is not None # note .search
True
>>> v = sre_yield.AllStrings(s)
>>> v.__len__()
23
>>> 'bu' in v
True
>>> v[:5]
['bu', 'rt', 'nt', 'ce', 'oe']
If you do want to emulate search, you end up with a large number of matches
quickly. Limiting the repetition a bit helps, but it's still a very large
number.
.. code-block:: pycon
>>> v2 = sre_yield.AllStrings('.{,30}(' + s + ').{,30}')
>>> el = v2.__len__() # too big for int
>>> print(str(el).rstrip('L'))
57220492262913872576843611006974799576789176661653180757625052079917448874638816841926032487457234703154759402702651149752815320219511292208238103
>>> 'kennedy' in v2
True
Capturing Groups
If you're interested in extracting what would match during generation of a
value, you can use AllMatches instead to get Match objects.
.. code-block:: pycon
>>> v = sre_yield.AllMatches(r'a(\d)b')
>>> m = v[0]
>>> m.group(0)
'a0b'
>>> m.group(1)
'0'
This even works for simplistic backreferences, in this case to have matching quotes.
.. code-block:: pycon
>>> v = sre_yield.AllMatches(r'(["\'])([01]{3})\1')
>>> m = v[0]
>>> m.group(0)
'"000"'
>>> m.groups()
('"', '000')
>>> m.group(1)
'"'
>>> m.group(2)
'000'
Reporting Bugs, etc.
We welcome bug reports -- see our issue tracker on GitHub <https://github.com/google/sre_yield/issues>
_ to see if it's been reported before.
If you'd like to discuss anything, we have a Google Group <https://groups.google.com/group/sre_yield>
_ as well.
Related Modules
We're aware of three similar modules, but each has a different goal.
xeger
Xeger was originally written in Java <https://code.google.com/p/xeger/>
_ and
ported to Python <https://bitbucket.org/leapfrogdevelopment/rstr>
_. This
generates random entries, which may suffice if you want to get just a few
matching values. This module and xeger
differ statistically in the way
they handle repetitions:
.. code-block:: pycon
>>> import random
>>> v = sre_yield.AllStrings('[abc]{1,4}')
>>> len(v)
120
# Now random.choice(v) has a 3/120 chance of choosing a single letter.
>>> len([x for x in v if len(x) == 1])
3
# xeger(v) has ~25% chance of choosing a single letter, because the length
and match are chosen independently.
# (This doesn't run, so the doctests don't require xeger)
> from rstr import xeger
> sum([1 if len(xeger('[abc]{1,4}')) == 1 else 0 for _ in range(120)])
26
In addition, xeger
differs in the default matching of '.'
is for
printable characters (which you can get by setting charset=string.printable
in sre_yield
, if desired).
sre_dump
Another module that walks sre_parse
's tree is sre_dump
, although it
does not generate matches, only reconstructs the string pattern (useful
primarily if you hand-generate a tree). If you're interested in the space,
it's a good read. http://www.dalkescientific.com/Python/sre_dump.html
jpetkau1
Can find matches by using randomization, so sort of handles anchors. Not
guaranteed though, but another good look at internals.
http://web.archive.org/web/20071024164712/http://www.uselesspython.com/jpetkau1.py
(and slightly older version in the announcement on python-list <https://mail.python.org/pipermail/python-list/2001-August/104757.html>
_).
Differences between sre_yield and the re module
There are certainly valid regular expressions which sre_yield
does not
handle. These include things like lookarounds, backreferences, but also a few
other exceptions:
-
The maximum value for repeats is system-dependant -- CPython's sre
module
there's a special value which is treated as infinite (either 216-1 or
232-1 depending on build). In sre_yield, this is taken as a literal,
rather than infinite, thus (on a 2**16-1 platform):
.. code-block:: pycon
>>> len(sre_yield.AllStrings('a*')[-1])
65535
>>> import re
>>> len(re.match('.*', 'a' * 100000).group(0))
100000
-
The re
module docs <http://docs.python.org/2/library/re.html#regular-expression-syntax>
_
say "Regular expression pattern strings may not contain null bytes"
yet this appears to work fine.
-
Order does not depend on greediness.
-
The regex is treated as fullmatch.
-
sre_yield
is confused by complex uses of anchors, but support simple ones:
.. code-block:: pycon
>>> list(sre_yield.AllStrings('foo$'))
['foo']
>>> list(sre_yield.AllStrings('^$'))
['']
>>> list(sre_yield.AllStrings('.\\b.')) # doctest: +IGNORE_EXCEPTION_DETAIL
Traceback (most recent call last):
...
ParseError: Non-end-anchor None found at END state