🚨 Shai-Hulud Strikes Again:834 Packages Compromised.Technical Analysis
Socket
Book a DemoInstallSign in
Socket

regex-automata

Package Overview
Dependencies
Maintainers
1
Versions
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

regex-automata

Library for regular expressions using finite automata

pipPyPI
Version
0.3.1
Maintainers
1

CI - build CI - coverage MyPy & Ruffle checked

regex-automata

A toy implementation of regular expressions using finite automata. Its API is modeled after the standard re module, so it can be used as a drop-in replacement (not all re features are supported, though, and it behaves differently in edge cases).

Intended as a white-box implementation, it gives full tracing output for parsing and matching. Diagrams of abstract syntax tree and the automaton are also available.

Usage

import regex_automata as re
import logging

logging.basicConfig(level=logging.INFO)             # show verbose output

pattern = re.compile(r"(foo)*bar|baz")              # regex_automata.Pattern

m = pattern.fullmatch("foofoobar")                  # regex_automata.Match
pattern.fullmatch("foo")                            # None

pattern.ast                                         # regex_automata.parser.ast.AstNode
pattern.raw_ast                                     # regex_automata.parser.ast.AstNode
pattern.nfa                                         # regex_automata.automata.nfa.NFA

pattern.render_ast("regex_ast.svg")
pattern.render_ast("regex_ast_raw.svg", raw=True)
pattern.render_nfa("regex_nfa.svg")

pattern2 = re.compile(r"[a-z_-][a-z0-9_-]*", re.IGNORECASE)
pattern2.tokens
# [CharacterSet(span=(0, 7), text='[a-z_-]', set=RangeSet(((45, 46), (95, 96), (97, 123)))),
#  CharacterSet(span=(7, 17), text='[a-z0-9_-]', set=RangeSet(((45, 46), (48, 58), (95, 96), (97, 123)))),
#  Repetition(span=(17, 18), text='*', min=0, max=None)]

list(re.finditer(r"[0-9]{2,}", "123"))
# [<Match span=(0, 3), match='123'>]
list(re.finditer(r"[0-9]{2,}", "123", all_matches=True))
# [<Match span=(0, 3), match='123'>, <Match span=(0, 2), match='12'>, <Match span=(1, 3), match='23'>]

Abstract syntax tree of "(foo)*bar|baz" (ie. pattern.ast):

tree for (foo)*bar|baz

Finite automaton accepting "(foo)*bar|baz" (ie. pattern.nfa):

automaton for (foo)*bar|baz

Features compared to standard re module

regex-automata is generally compatible with re - features either work as intended or fail with regex_automata.errors.UnsupportedSyntaxError. In some edge cases, results differ: most notably when there are multiple greedy quantifiers next to each other.

regex-automata passes 299/305 re pattern tests, with additional 98 tests ignored due to testing unsupported features.

  • Library

    • match(), fullmatch(), search(), finditer(), sub(), subn() methods
    • Match object containing span, matched text and groups
    • flags DOTALL, IGNORECASE and MULTILINE
  • Syntax

    • character sets: ., [...] (special sequences such as \w are supported, but not inside square brackets)
    • repetition: *, ?, +, {n,k}
    • boundary assertions: ^, $, \b, \B, \A, \Z
    • groups: (...), (?:...), (?P<name>...)
    • inline flags eg. (?i)
    • comments (?#...)
  • Notable features that are not supported by this library but are in standard re:

    • backreferences in patterns (\1, \g<1> etc.)
    • lookahead/lookbehind assertions ((?=...), (?!...), (?<=...), (?<!...))
    • non-greedy and possessive repetition (*?, *+, etc.)
    • bytes support
    • UNICODE flag, non-ASCII meaning for \d etc.

Implementation overview

  • Input pattern is tokenized via regex_automata.parser.tokenizer.Tokenizer
    • Characters and sets are represented with regex_automata.automata.rangeset.RangeSet
  • List of tokens is processed by recursive descent parser regex_automata.parser.parser.Parser
  • Parser produces "raw" abstract syntax tree composed of regex_automata.parser.ast.AstNode nodes
  • AST is processed with regex_automata.parser.ast_processor.ASTProcessor to produce the final tree
    • This is used to replace fancy repetition with primitives (union, concatenation, iteration)
  • Epsilon-free NFA is recursively constructed from the AST using regex_automata.regex.nfa_builder.NFABuilder
  • The processed pattern is stored in regex_automata.regex.pattern.Pattern, which is the high-level interface
  • When processing input text, the text and NFA are passed to regex_automata.regex.nfa_evaluator.NFAEvaluator
  • The evaluator produces regex_automata.regex.match.Match objects

Grammar

The recursive descent parser uses the following LL(1) grammar:

 1.  E  → F E'
 2.  E' → | E
 3.  E' → ε
 4.  F  → G F'
 5.  F' → G F'
 6.  F' → ε
 7.  G  → H G'
 8.  G' → *
 9.  G' → ε
10.  H  → ( E )
11.  H  → a                     (a character or character set)
12.  E  → ε
13.  F  → ε
14.  G  → boundary_assertion    (one of: ^, $, \A, \Z, \b, \B)

Which is derived from the following CFG:

E → F | E
E → F
E → ε
F → G F
F → G
G → H *
G → H
G → boundary_assertion
H → ( E )
H → a

Which is derived from the following CFG:

E → E | E
E → E E
E → E *
E → ( E )
E → a
E → boundary_assertion
E → ε

License

MIT, see LICENSE.txt.

Keywords

finite automata

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts