- Build and compose maintainable regular exprssions in JavaScript.
- ReDOS, begone!
Highlights
Fixing ReDOS
import {atomic, sequence} from 'compose-regexp'
const ReDOS = /^(([a-z])+.)+[A-Z]([a-z])+$/
const fixed = sequence(/^/, atomic(/(([a-z])+.)+/), /[A-Z]([a-z])+$/)
You can test it on flems.io.
Combining character sets
import {bound, charSet, flags, suffix} from 'compose-regexp'
const LcGrekLetter = charSet.intersection(/\p{Lowercase}/u, /\p{Script=Greek}/u)
LcGrekLetter.test("Γ")
LcGrekLetter.test("γ")
LcGrekLetter.test("x")
const b = bound(/\p{Script=Greek}/u)
const LcGrekWords = flags.add('g', b, suffix("+", LcGrekLetter), b)
for (lc of `Θεωρείται ως ο σημαντικότερος θεμελιωτής ...`.matchAll(LcGrekWords)) {
console.log(lc)
}
You can test it live on flems.io.
TOC
Why you may need this lib
Regular expressions don't do justice to regular grammars.
Complex RegExps hard to read, debug and modify. For code reuse, we often resort to source string concatenation which is error prone and even less readable than RegExp literals.
Also, the matching engines are, by spec, required to backtrack on match failures. This can enable surprising behavior, and even the ReDOS attack where maliciously crafted input triggers exponential backtracking, a heavy computation that can freeze programs for hours if not for years.
compose-regexp
to the rescue!
compose-regexp
will let you take advantage of the finely honed RegExp engines shipped in Browsers while making RegExps readable, testable, and ReDOS-proof witout deep knowledge of the engines.
It doesn't make regular grammars more powerful, they are still fundamentally limited, but since they are ubiquitous, we may as well have better tooling to put them to use...
compose-regexp
is reasonably small, and doesn't have dependencies. You can use it as a plain dependency, or, for client-side apps, in a server-side script that generates the RegExps that you ship to the browsers.
Usage
Installation
$ npm install --save compose-regexp
API
In a nutshell
either(...exprs)
sequence(...exprs)
suffix(quantifier, ...exprs)
maybe(...exprs)
flags.add(flags, ...exprs)
capture(...exprs)
namedCapture(label, ...exprs)
ref(nuberOrLabel)
ref(num, depth)
lookAhead(...exprs)
notAhead(...exprs)
lookBehind(...exprs)
notBehind(...exprs)
atomic(...exprs)
bound(...exprs)
noBound(...exprs)
charSet.difference(a, b)
charSet.intersection(a, b)
charSet.complement(a)
charSet.union(...cs)
Here's a flems.io sandbox with the full API pre-loaded, and the string example above.
General notes
-
The ...exprs
parameters of these functions can be either RegExps, strings, or arrays of expr
s. Arrays of exprs are treated as nested sequences.
-
Special characters in strings are escaped, so that '.*'
is equivalent to /\.\*/
.
Therefore:
> sequence('.', '*').source === '\\.\\*'
whereas:
> sequence(/./', /a/).source === '.a'
-
compose-regexp
understand RegExp syntax, and will add non-capturing groups automatically where relevant. e.g. suffix('*', '.', /\w+/)
will turn into /(?:\.\w+)*/
-
When composing RegExps with mixed flags:
- The
u
flag is contagious, and non-u
. RegExps will be upgraded if possible. - The other flags of regexps passed as parameters are ignored, and always reset to false on the result unless set by
flags()
. This is obviously suboptimal, and will be improved in time.
-
Back references (\1
, etc...) are automatically upgraded suc that sequence(/(\w)\1/, /(\d)\1/)
becomes /(\w)\1(\d)\2/
. The ref()
function lets one create refs programmatically:
> sequence(capture(), ref(1))
/()\1/
> sequence(capture, /\1/)
/()\2/
Basic combinators
either(...exprs) : RegExp
> either(/a/, /b/, /c/)
/a|b|c/
> either(
[/a/, /b/]
[/c/, /d/]
)
/ab|cd/
sequence(...exprs) : RegExp
> sequence(/a/, /b/, /c/)
/abc/
compose-regexp
inserts non-capturing groups where needed:
> sequence(/a/, /b|c/)
/a(?:b|c)/
suffix(quantifier, ...exprs) : RegExp
suffix(quantifier)(...exprs) : RegExp
Valid quantifiers:
greedy | non-greedy |
---|
? | ?? |
* | *? |
+ | +? |
{n} | {n}? |
{n,} | {n,}? |
{m,n} | {m,n}? |
non-string quantifiers are converted to String and wrapped in braces such that
suffix(3)
is equivalent to suffix('{3}')
suffix([1,3])
is equivalent to suffix('{1,3}')
suffix([2,,])
is equivalent to suffix('{2,}')
> suffix("*", either(/a/, /b/, /c/))
/(?:a|b|c)*/
> zeroOrMore = suffix('*')
> zeroOrMore('a')
/a*/
maybe(...exprs) : RegExp
shorcut for the ?
quantifier
> maybe('a')
/a?/
flags(opts, ...exprs) : RegExp
flags(opts)(...exprs) : RegExp
> flags('gm', /a/)
/a/gm
> global = flags(g); global('a')
/a/g
Captures and References
capture(...exprs) : RegExp
Create an anonymous capturing group.
> capture(/a/, /b/, /c/)
/(abc)/
namedCapture(...exprs) : RegExp
Create an named capturing group.
> capture(/a/, /b/, /c/)
/(abc)/
ref(label: string) : RegExp
ref(n: number) : RegExp (thunk)
ref(n: number, depth: number) : RegExp (thunk)
When passed a number, returns a specially crafted RegExp that can't match anything as is but will be turned into a back reference when composed.
See the back references section below for a detailed description
> ref("label")
/\k<label>/
> ref(1)
/$d:0,n:1^/
> sequence(capture(/./), ref(1))
/(.)\1/
> sequence(capture(/\w/), either(capture(ref(1, 2)), /./u))
/(\w)(?:(\1)|.)/u
Assertions
lookAhead(...exprs) : RegExp
Positive look ahead.
> lookAhead(/a/, /b/, /c/)
/(?=abc)/
notAhead(...exprs) : RegExp
Negative look ahead.
> notAhead(/a/, /b/, /c/)
/(?!abc)/
lookBehind(...exprs) : RegExp
Requires an engine that supports look behind assertions
Positive look behind.
> lookBehind(/a/, /b/, /c/)
/(?<=abc)/
notBehind(...exprs) : RegExp
Requires an engine that supports look behind assertions
Negative look behind.
> notBehind(/a/, /b/, /c/)
/(?<!abc)/
Derived Utilites
atomic(...exprs) : RegExp
Returns a RegExp that will match sequence(...exprs)
, but into which the engine won't backtrack once it has succeeded.
> atomic(/\w+?/)
/(?=(\w+?))\1/
> lookBehind(atomic(/\w+?/))
> lookBehind(()=>atomic(/\w+?/))
/(?<=\1(?<=(\w+?)))/
atomic()
adds an unnamed capturing group. There's no way around it as of until JS adds support for atomic groups. You'd be better off using named capturing groups if you want to extract sub-matches, they are easier the handle than match indices which go all over the place anyway when you compose RegExps. See the atomic matching section for more detals.
RegExps with captures have custom behavior when using someString.split(regexp)
, which probably isn't what you want (the captures are inserted into the resulting arrays, with undefined
for the capturing groups that didn't match). Avoid atomic()
for that purpose.
bound(...exprs) : RegExp
Requires an engine that supports look behind assertions
Returns a RegExp that works like /\b/
does, but for an arbitrary char set.
const numBound = flags.add('y', bound(/[0-9]/))
numBound.test("q88p")
numBound.lastIndex = 1
numBound.test("q88p")
numBound.lastIndex = 2
numBound.test("q88p")
numBound.lastIndex = 3
numBound.test("q88p")
numBound.lastIndex = 4
numBound.test("q88p")
noBound(...exprs) : RegExp
Requires an engine that supports look behind assertions
noBound(x)
returns a RegExp that succeeds where bounds(x)
fails, and vice-versa.
charSet.difference(a, b) : RegExp
charSet.intersection(a, b) : RegExp
charSet.complement(cs) : RegExp
charSet.union(...cs) : RegExp
Set operations on charSets... well, operations on arbitrary RegExps, actually. They can be fed anything but are probably most useful when used with CharSets, CharClasses, and Unicode properties.
charSet.difference(a, b)
: returns a RegExp that matches the characters matched by a
and don't match those of b
const ab = charSet.difference(/[a-d]/, /[cd]/)
ab.test(a)
ab.test(b)
ab.test(c)
ab.test(d)
charSet.intersection(a, b)
: returns a RegExp that matches characters matched by both a
and b
.
const bc = charSet.intersection(/[a-c]/, /[b-d]/)
bc.test(a)
bc.test(b)
bc.test(c)
bc.test(d)
This is especially useful when combined with Unicode properties:
const LcCyrl = charSet.intersection(/\p{Lowercase}/u, /\p{Script=Cyrillic}/u)
LcCyrl.test("б")
LcCyrl.test("Б")
LcCyrl.test("b")
const UcGrek = = charSet.intersection(/\p{Uppuercase}/u, /\p{Script=Greek}/u)
UcGrek.test("Γ")
UcGrek.test("γ")
UcGrek.test("W")
const asciiNonLetter = charSet.difference(/[\0\x7f]/, /[A-Za-z]/)
asciiNonLetter.test(":")
asciiNonLetter.test("a")
The full list of supported Unicode properties is listed in the ECMAScript spec.
charSet.complement(cs)
: Returns a RegExp that matches when the argument doesn't
const notDF = charSet.complement(/[D-F]/)
notDF.test("C")
notDF.test("DEF")
notDF.test("G")
charSet.union(...cs)
: returns a RegExp that matches any of the arguments
const abcd = charSet.union(/[ab]/, /c/, /d/)
abcd.test(a)
abcd.test(b)
abcd.test(c)
abcd.test(d)
abcd.test(e)
Atomic matching
Atomic groups prevent the RegExp engine from backtracking into them, aleviating the infamous ReDOS attack. JavaScript doesn't support them out of the box as of 2022, but they can be emulated using the /(?=(your stuff here))\1/
pattern. We provide a convenient atomic()
helper that wraps regexps in such a way, making them atomic at the boundary. Putting an atomic()
call around an exprssion is not enough to prevent backtracking inside of it, you'll have to put them around every exprssion that could backtrack problematically. (//TODO: give examples. Feel free to open a PR or an issue do discuss this).
Also, the atomic()
helper creates a capturing group, offsetting the indices of nested and further captures. It is better to rely on named captures for extracting values from a match, as numbered captures go all over the place when composing.
In look behind assertions (lookBehind(...)
and notBehind(...)
a.k.a. /(?<=...)/
and /(?<!...)/
) matching happens backwards. For atomic matching in lookBehind assertions, wrap the arguments inside a function, in that context, atomic('x')
produces /\1(?<=(x))/
.
lookBehind(atomic(/.*?/))
lookBehind(()=>atomic(/.*?/))
To better undestand the behavior of back-references in compound regexps, see the next section.
Let's talk about flags
The g
, d
and y
flags of input RegExps will be ignored by the combinators. The resulting RegExp will not have them (unless added manually with flags.add()
).
The u
flag is contagious when possible. E.G. sequence(/./u, /a/)
returns /.a/u
. However the meaning of sequence(/\p{Any}/u, /./)
is ambiguous. We don't know if you want /\p{Any}./u
, or /\p{Any}(?![\10000-\10ffff])./u
, avoiding matches in the Astral planes, as /./
would. In scenarios like this one, and in cases where a non-u
RegExp whose source is not allowed in u
mode is mixed with one that has a u
flag, an error is thrown.
RegExps with the m
and the s
flags are converted to flagless regexps that match the same input. So for example /./s
becomes /[^]/
. The pattern is a bit more involved for /^/
and /$/
with the m
flag. If your RegExp engine doesn't support look behind assertions, the m
flag is preserved and is handled like the i
flag (see below).
RegExps with the i
flag can't be mixed with i
-less RegExps, and vice-versa. You need an "all-i
" or an "all-non-i
" cast for a given composition (strings are fine in both cases, they are flag-agnostic).
Back References
Regular exprssions let one reference the value of a previous group by either numbered or labeled back-references. Labeled back-references refer to named groups, whereas the index of a numbered back references can map to either a named or an unnamed group.
/(.)\1/.test("aa")
/(.)\1/.test("ab")
/(?<x>.)\1/.test("aa")
/(?<x>.)\k<x>/.test("aa") // true
compose-regexp
takes back references into account when composing RegExps, and will update the indices of numbered refs.
sequence(/(.)\1/, /(.)\1/)
In look behind assertions, be they positive or negative, patterns are applied backward (i.e. right-to-left) to the input string. In that scenario, back references must appear before the capture they reference:
/(?<=\1(.))./.exec("abccd")
Patterns with back reference intended for general, forward usage will be useless in look behind context, and compose-regexp
will reject them. If you want to use back references in patterns that are interpreted backwards, you must use a function:
const errorInTheMaking = sequence(capture(/./), ref(1))
const bw = lookBehind(errorInTheMaking)
const bw = lookBehind(()=>[ref(1), capture(/./)])
The atomic()
helper is direction sensitive. When called in backward context, it produces a result that will be atomic when interpreted backwards.
atomic("a")
lookBehind(atomic(a))
lookBehind(()=>atomic(a))
For scenarios where you'd like to use a back reference in a nested context, you can use the second optional depth
parameter to ref(n, depth)
sequence(capture(/./), either(capture("a", ref(1, 2), "b"), /./u)
sequence(capture(/./), either(capture("a", ref(1), "b"), /./)
The depth is 2
, for the levels in the call stack(one for capture()
, one for either()
).
Browser support
This library is meant to work in ES5 environments, provided you use minimal polyfills (Object.assign()
, essentially), but it hasn't been tested in that environment. If you find bugs please send them my way. TODO: bundle the test suite for old IE testing, stripping out tests that can't work there.
Some of the library's features rely on newer RegExp features. The u
flag can't be pulled out of thin air, for example, and neither can look behind assertions.
Limitations and missing pieces
-
compose-regexp
will not be able to mix i
-flagged and non-i
-flagged without native support for the scenario. Case-insensitive matching involves case folding both the pattern and the source string, and compose-regexp
can't access the latter.
-
The logic to detect empty patterns only works for rudimentary ones (enough to deal with what we produce AFAICT). TODO: improve this
License MIT
The MIT License (MIT)
Copyright (c) 2016 Pierre-Yves Gérardy
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.