
regex-automata
A toy implementation of regular expressions using finite automata. Its API is modeled after
the standard re module, so it can be used as a drop-in replacement (not all re features
are supported, though, and it behaves differently in edge cases).
Intended as a white-box implementation, it gives full tracing output for parsing and matching.
Diagrams of abstract syntax tree and the automaton are also available.
Usage
import regex_automata as re
import logging
logging.basicConfig(level=logging.INFO)
pattern = re.compile(r"(foo)*bar|baz")
m = pattern.fullmatch("foofoobar")
pattern.fullmatch("foo")
pattern.ast
pattern.raw_ast
pattern.nfa
pattern.render_ast("regex_ast.svg")
pattern.render_ast("regex_ast_raw.svg", raw=True)
pattern.render_nfa("regex_nfa.svg")
pattern2 = re.compile(r"[a-z_-][a-z0-9_-]*", re.IGNORECASE)
pattern2.tokens
list(re.finditer(r"[0-9]{2,}", "123"))
list(re.finditer(r"[0-9]{2,}", "123", all_matches=True))
Abstract syntax tree of "(foo)*bar|baz" (ie. pattern.ast):

Finite automaton accepting "(foo)*bar|baz" (ie. pattern.nfa):

Features compared to standard re module
regex-automata is generally compatible with re - features either work as intended or
fail with regex_automata.errors.UnsupportedSyntaxError. In some edge cases, results differ:
most notably when there are multiple greedy quantifiers next to each other.
regex-automata passes 299/305 re pattern tests,
with additional 98 tests ignored due to testing unsupported features.
Implementation overview
- Input pattern is tokenized via
regex_automata.parser.tokenizer.Tokenizer
- Characters and sets are represented with
regex_automata.automata.rangeset.RangeSet
- List of tokens is processed by recursive descent parser
regex_automata.parser.parser.Parser
- Parser produces "raw" abstract syntax tree composed of
regex_automata.parser.ast.AstNode nodes
- AST is processed with
regex_automata.parser.ast_processor.ASTProcessor to produce the final tree
- This is used to replace fancy repetition with primitives (union, concatenation, iteration)
- Epsilon-free NFA is recursively constructed from the AST using
regex_automata.regex.nfa_builder.NFABuilder
- The processed pattern is stored in
regex_automata.regex.pattern.Pattern, which is the high-level interface
- When processing input text, the text and NFA are passed to
regex_automata.regex.nfa_evaluator.NFAEvaluator
- The evaluator produces
regex_automata.regex.match.Match objects
Grammar
The recursive descent parser uses the following LL(1) grammar:
1. E → F E'
2. E' → | E
3. E' → ε
4. F → G F'
5. F' → G F'
6. F' → ε
7. G → H G'
8. G' → *
9. G' → ε
10. H → ( E )
11. H → a (a character or character set)
12. E → ε
13. F → ε
14. G → boundary_assertion (one of: ^, $, \A, \Z, \b, \B)
Which is derived from the following CFG:
E → F | E
E → F
E → ε
F → G F
F → G
G → H *
G → H
G → boundary_assertion
H → ( E )
H → a
Which is derived from the following CFG:
E → E | E
E → E E
E → E *
E → ( E )
E → a
E → boundary_assertion
E → ε
License
MIT, see LICENSE.txt.