🚨 Shai-Hulud Strikes Again:834 Packages Compromised.Technical Analysis →

Book a Demo Install Sign in

regex-automata

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

regex-automata

Library for regular expressions using finite automata

PyPI

Version: 0.3.1

Maintainers: 1

regex-automata

A toy implementation of regular expressions using finite automata. Its API is modeled after the standard re module, so it can be used as a drop-in replacement (not all re features are supported, though, and it behaves differently in edge cases).

Intended as a white-box implementation, it gives full tracing output for parsing and matching. Diagrams of abstract syntax tree and the automaton are also available.

Usage

import regex_automata as re
import logging

logging.basicConfig(level=logging.INFO)             # show verbose output

pattern = re.compile(r"(foo)*bar|baz")              # regex_automata.Pattern

m = pattern.fullmatch("foofoobar")                  # regex_automata.Match
pattern.fullmatch("foo")                            # None

pattern.ast                                         # regex_automata.parser.ast.AstNode
pattern.raw_ast                                     # regex_automata.parser.ast.AstNode
pattern.nfa                                         # regex_automata.automata.nfa.NFA

pattern.render_ast("regex_ast.svg")
pattern.render_ast("regex_ast_raw.svg", raw=True)
pattern.render_nfa("regex_nfa.svg")

pattern2 = re.compile(r"[a-z_-][a-z0-9_-]*", re.IGNORECASE)
pattern2.tokens
# [CharacterSet(span=(0, 7), text='[a-z_-]', set=RangeSet(((45, 46), (95, 96), (97, 123)))),
#  CharacterSet(span=(7, 17), text='[a-z0-9_-]', set=RangeSet(((45, 46), (48, 58), (95, 96), (97, 123)))),
#  Repetition(span=(17, 18), text='*', min=0, max=None)]

list(re.finditer(r"[0-9]{2,}", "123"))
# [<Match span=(0, 3), match='123'>]
list(re.finditer(r"[0-9]{2,}", "123", all_matches=True))
# [<Match span=(0, 3), match='123'>, <Match span=(0, 2), match='12'>, <Match span=(1, 3), match='23'>]

Abstract syntax tree of "(foo)*bar|baz" (ie. pattern.ast):

tree for (foo)*bar|baz

Finite automaton accepting "(foo)*bar|baz" (ie. pattern.nfa):

automaton for (foo)*bar|baz

Features compared to standard `re` module

regex-automata is generally compatible with re - features either work as intended or fail with regex_automata.errors.UnsupportedSyntaxError. In some edge cases, results differ: most notably when there are multiple greedy quantifiers next to each other.

regex-automata passes 299/305 re pattern tests, with additional 98 tests ignored due to testing unsupported features.

Library
- match(), fullmatch(), search(), finditer(), sub(), subn() methods
- Match object containing span, matched text and groups
- flags DOTALL, IGNORECASE and MULTILINE
Syntax
- character sets: ., [...] (special sequences such as \w are supported, but not inside square brackets)
- repetition: *, ?, +, {n,k}
- boundary assertions: ^, $, \b, \B, \A, \Z
- groups: (...), (?:...), (?P<name>...)
- inline flags eg. (?i)
- comments (?#...)
Notable features that are not supported by this library but are in standard re:
- backreferences in patterns (\1, \g<1> etc.)
- lookahead/lookbehind assertions ((?=...), (?!...), (?<=...), (?<!...))
- non-greedy and possessive repetition (*?, *+, etc.)
- bytes support
- UNICODE flag, non-ASCII meaning for \d etc.

Implementation overview

Input pattern is tokenized via regex_automata.parser.tokenizer.Tokenizer
- Characters and sets are represented with regex_automata.automata.rangeset.RangeSet
List of tokens is processed by recursive descent parser regex_automata.parser.parser.Parser
Parser produces "raw" abstract syntax tree composed of regex_automata.parser.ast.AstNode nodes
AST is processed with regex_automata.parser.ast_processor.ASTProcessor to produce the final tree
- This is used to replace fancy repetition with primitives (union, concatenation, iteration)
Epsilon-free NFA is recursively constructed from the AST using regex_automata.regex.nfa_builder.NFABuilder
The processed pattern is stored in regex_automata.regex.pattern.Pattern, which is the high-level interface
When processing input text, the text and NFA are passed to regex_automata.regex.nfa_evaluator.NFAEvaluator
The evaluator produces regex_automata.regex.match.Match objects

Grammar

The recursive descent parser uses the following LL(1) grammar:

 1.  E  → F E'
 2.  E' → | E
 3.  E' → ε
 4.  F  → G F'
 5.  F' → G F'
 6.  F' → ε
 7.  G  → H G'
 8.  G' → *
 9.  G' → ε
10.  H  → ( E )
11.  H  → a                     (a character or character set)
12.  E  → ε
13.  F  → ε
14.  G  → boundary_assertion    (one of: ^, $, \A, \Z, \b, \B)

Which is derived from the following CFG:

E → F | E
E → F
E → ε
F → G F
F → G
G → H *
G → H
G → boundary_assertion
H → ( E )
H → a

Which is derived from the following CFG:

E → E | E
E → E E
E → E *
E → ( E )
E → a
E → boundary_assertion
E → ε

License

MIT, see LICENSE.txt.

Keywords

finite automata

regex

regular expressions

FAQs

What is regex-automata?

Is regex-automata well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install