
Security News
The Changelog Podcast: Practical Steps to Stay Safe on npm
Learn the essential steps every developer should take to stay secure on npm and reduce exposure to supply chain attacks.
The goal of sre_yield is to efficiently generate all values that can match a
given regular expression, or count possible matches efficiently. It uses the
parsed regular expression, so you get a much more accurate result than trying
to just split strings.
.. code-block:: pycon
>>> s = 'foo|ba[rz]'
>>> s.split('|') # bad
['foo', 'ba[rz]']
>>> import sre_yield
>>> list(sre_yield.AllStrings(s)) # better
['foo', 'bar', 'baz']
It does this by walking the tree as constructed by sre_parse (same thing
used internally by the re module), and constructing chained/repeating
iterators as appropriate. There may be duplicate results, depending on your
input string though -- these are cases that sre_parse did not optimize.
.. code-block:: pycon
>>> import sre_yield
>>> list(sre_yield.AllStrings('.|a', charset='ab'))
['a', 'b', 'a']
...and happens in simpler cases too:
.. code-block:: pycon
>>> list(sre_yield.AllStrings('a|a'))
['a', 'a']
The membership check, 'abc' in values_obj is by necessity fullmatch -- it
must cover the entire string. Imagine that it has ^(...)$ around it.
Because re.search can match anywhere in an arbitrarily string, emulating
this would produce a large number of junk matches -- probably not what you
want. (If that is what you want, add a .* on either side.)
Here's a quick example, using the presidents regex from http://xkcd.com/1313/
.. code-block:: pycon
>>> s = 'bu|[rn]t|[coy]e|[mtg]a|j|iso|n[hl]|[ae]d|lev|sh|[lnd]i|[po]o|ls'
>>> import re
>>> re.search(s, 'kennedy') is not None # note .search
True
>>> v = sre_yield.AllStrings(s)
>>> v.__len__()
23
>>> 'bu' in v
True
>>> v[:5]
['bu', 'rt', 'nt', 'ce', 'oe']
If you do want to emulate search, you end up with a large number of matches quickly. Limiting the repetition a bit helps, but it's still a very large number.
.. code-block:: pycon
>>> v2 = sre_yield.AllStrings('.{,30}(' + s + ').{,30}')
>>> el = v2.__len__() # too big for int
>>> print(str(el).rstrip('L'))
57220492262913872576843611006974799576789176661653180757625052079917448874638816841926032487457234703154759402702651149752815320219511292208238103
>>> 'kennedy' in v2
True
If you're interested in extracting what would match during generation of a value, you can use AllMatches instead to get Match objects.
.. code-block:: pycon
>>> v = sre_yield.AllMatches(r'a(\d)b')
>>> m = v[0]
>>> m.group(0)
'a0b'
>>> m.group(1)
'0'
This even works for simplistic backreferences, in this case to have matching quotes.
.. code-block:: pycon
>>> v = sre_yield.AllMatches(r'(["\'])([01]{3})\1')
>>> m = v[0]
>>> m.group(0)
'"000"'
>>> m.groups()
('"', '000')
>>> m.group(1)
'"'
>>> m.group(2)
'000'
We welcome bug reports -- see our issue tracker on GitHub <https://github.com/google/sre_yield/issues>_ to see if it's been reported before.
If you'd like to discuss anything, we have a Google Group <https://groups.google.com/group/sre_yield>_ as well.
We're aware of three similar modules, but each has a different goal.
Xeger was originally written in Java <https://code.google.com/p/xeger/>_ and
ported to Python <https://bitbucket.org/leapfrogdevelopment/rstr>_. This
generates random entries, which may suffice if you want to get just a few
matching values. This module and xeger differ statistically in the way
they handle repetitions:
.. code-block:: pycon
>>> import random
>>> v = sre_yield.AllStrings('[abc]{1,4}')
>>> len(v)
120
# Now random.choice(v) has a 3/120 chance of choosing a single letter.
>>> len([x for x in v if len(x) == 1])
3
# xeger(v) has ~25% chance of choosing a single letter, because the length
and match are chosen independently.
# (This doesn't run, so the doctests don't require xeger)
> from rstr import xeger
> sum([1 if len(xeger('[abc]{1,4}')) == 1 else 0 for _ in range(120)])
26
In addition, xeger differs in the default matching of '.' is for
printable characters (which you can get by setting charset=string.printable
in sre_yield, if desired).
Another module that walks sre_parse's tree is sre_dump, although it
does not generate matches, only reconstructs the string pattern (useful
primarily if you hand-generate a tree). If you're interested in the space,
it's a good read. http://www.dalkescientific.com/Python/sre_dump.html
Can find matches by using randomization, so sort of handles anchors. Not
guaranteed though, but another good look at internals.
http://web.archive.org/web/20071024164712/http://www.uselesspython.com/jpetkau1.py
(and slightly older version in the announcement on python-list <https://mail.python.org/pipermail/python-list/2001-August/104757.html>_).
There are certainly valid regular expressions which sre_yield does not
handle. These include things like lookarounds, backreferences, but also a few
other exceptions:
The maximum value for repeats is system-dependant -- CPython's sre module
there's a special value which is treated as infinite (either 216-1 or
232-1 depending on build). In sre_yield, this is taken as a literal,
rather than infinite, thus (on a 2**16-1 platform):
.. code-block:: pycon
>>> len(sre_yield.AllStrings('a*')[-1])
65535
>>> import re
>>> len(re.match('.*', 'a' * 100000).group(0))
100000
The re module docs <http://docs.python.org/2/library/re.html#regular-expression-syntax>_
say "Regular expression pattern strings may not contain null bytes"
yet this appears to work fine.
Order does not depend on greediness.
The regex is treated as fullmatch.
sre_yield is confused by complex uses of anchors, but support simple ones:
.. code-block:: pycon
>>> list(sre_yield.AllStrings('foo$'))
['foo']
>>> list(sre_yield.AllStrings('^$'))
['']
>>> list(sre_yield.AllStrings('.\\b.')) # doctest: +IGNORE_EXCEPTION_DETAIL
Traceback (most recent call last):
...
ParseError: Non-end-anchor None found at END state
FAQs
Expands a regular expression to its possible matches
We found that sre-yield demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Learn the essential steps every developer should take to stay secure on npm and reduce exposure to supply chain attacks.

Security News
Experts push back on new claims about AI-driven ransomware, warning that hype and sponsored research are distorting how the threat is understood.

Security News
Ruby's creator Matz assumes control of RubyGems and Bundler repositories while former maintainers agree to step back and transfer all rights to end the dispute.