regexp-tree

Regular expressions processor in JavaScript
TL;DR: RegExp Tree is a regular expressions processor, which includes parser, traversal, transformer, optimizer, and interpreter APIs.
You can get an overview of the tool in this article.
Table of Contents
Installation
The parser can be installed as an npm module:
npm install -g regexp-tree
regexp-tree --help
You can also try it online using AST Explorer.
Development
- Fork https://github.com/DmitrySoshnikov/regexp-tree repo
- If there is an actual issue from the issues list you'd like to work on, feel free to assign it yourself, or comment on it to avoid collisions (open a new issue if needed)
- Make your changes
- Make sure
npm test still passes (add new tests if needed)
- Submit a PR
The regexp-tree parser is implemented as an automatic LR parser using Syntax tool. The parser module is generated from the regexp grammar, which is based on the regular expressions grammar used in ECMAScript.
For development from the github repository, run build command to generate the parser module:
git clone https://github.com/<your-github-account>/regexp-tree.git
cd regexp-tree
npm install
npm run build
./bin/regexp-tree --help
NOTE: You need to run build command every time you change the grammar file.
Usage as a CLI
Check the options available from CLI:
regexp-tree --help
Usage: regexp-tree [options]
Options:
-e, --expression A regular expression to be parsed
-l, --loc Whether to capture AST node locations
-o, --optimize Applies optimizer on the passed expression
-c, --compat Applies compat-transpiler on the passed expression
-t, --table Print NFA/DFA transition tables (nfa/dfa/all)
To parse a regular expression, pass -e option:
regexp-tree -e '/a|b/i'
Which produces an AST node corresponding to this regular expression:
{
type: 'RegExp',
body: {
type: 'Disjunction',
left: {
type: 'Char',
value: 'a',
symbol: 'a',
kind: 'simple',
codePoint: 97
},
right: {
type: 'Char',
value: 'b',
symbol: 'b',
kind: 'simple',
codePoint: 98
}
},
flags: 'i',
}
The --table option allows displaying NFA/DFA transition table:
./bin/regexp-tree -e '/ab/' --table all
Result:
> - starting
✓ - accepting
NFA transition table:
┌─────┬───┬─────────┐
│ │ a │ ε* │
├─────┼───┼─────────┤
│ 1 > │ │ {1,2,4} │
├─────┼───┼─────────┤
│ 2 │ 3 │ 2 │
├─────┼───┼─────────┤
│ 3 │ │ {3,4,2} │
├─────┼───┼─────────┤
│ 4 ✓ │ │ {4,2} │
└─────┴───┴─────────┘
DFA transition table:
┌───────┬───┐
│ │ a │
├───────┼───┤
│ 1 ✓ > │ 2 │
├───────┼───┤
│ 2 ✓ │ 2 │
└───────┴───┘
NOTE: the format of a regexp is / Body / OptionalFlags.
Usage from Node
The parser can also be used as a Node module:
const regexpTree = require('regexp-tree');
console.log(regexpTree.parse(/a|b/i));
Note, regexp-tree supports parsing regexes from strings, and also from actual RegExp objects (in general -- from any object which can be coerced to a string). If some feature is not implemented yet in an actual JavaScript RegExp, it should be passed as a string:
regexpTree.parse(/a|b/i);
regexpTree.parse('/./s');
Also note, that in string-mode, escaping is done using two slashes \\ per JavaScript:
regexpTree.parse(/\n/);
regexpTree.parse('/\\n/');
Capturing locations
For source code transformation tools it might be useful also to capture locations of the AST nodes. From the command line it's controlled via the -l option:
regexp-tree -e '/ab/' -l
This attaches loc object to each AST node:
{
type: 'RegExp',
body: {
type: 'Alternative',
expressions: [
{
type: 'Char',
value: 'a',
symbol: 'a',
kind: 'simple',
codePoint: 97,
loc: {
start: {
line: 1,
column: 1,
offset: 1,
},
end: {
line: 1,
column: 2,
offset: 2,
},
}
},
{
type: 'Char',
value: 'b',
symbol: 'b',
kind: 'simple',
codePoint: 98,
loc: {
start: {
line: 1,
column: 2,
offset: 2,
},
end: {
line: 1,
column: 3,
offset: 3,
},
}
}
],
loc: {
start: {
line: 1,
column: 1,
offset: 1,
},
end: {
line: 1,
column: 3,
offset: 3,
},
}
},
flags: '',
loc: {
start: {
line: 1,
column: 0,
offset: 0,
},
end: {
line: 1,
column: 4,
offset: 4,
},
}
}
From Node it's controlled via setOptions method exposed on the parser:
const regexpTree = require('regexp-tree');
const parsed = regexpTree
.parser
.setOptions({captureLocations: true})
.parse(/a|b/);
The setOptions method sets global options, which are preserved between calls. It is also possible to provide options per a single parse call, which might be more preferred:
const regexpTree = require('regexp-tree');
const parsed = regexpTree.parse(/a|b/, {
captureLocations: true,
});
Using traversal API
The traverse module allows handling needed AST nodes using visitor pattern. In Node the module is exposed as regexpTree.traverse method. Handlers receive an instance of NodePath class, which encapsulates node itself, its parent node, property, and index (in case if a node is a part of a collection).
Example:
const regexpTree = require('regexp-tree');
const ast = regexpTree.parse('/[a-z]{1,}/');
regexpTree.traverse(ast, {
Quantifier({node}) {
...
},
});
const re = regexpTree.generate(ast);
console.log(re);
Using transform API
NOTE: you can play with transformation APIs, and write actual transforms for quick tests in AST Explorer. See this example.
While traverse module provides basic traversal API, which can be used for any purposes of AST handling, transform module focuses mainly on transformation of regular expressions.
It accepts a regular expressions in different formats (string, an actual RegExp object, or an AST), applies a set of transformations, and retuns an instance of TransformResult. Handles receive as a parameter the same NodePath object used in traverse.
Example:
const regexpTree = require('regexp-tree');
const re = regexpTree.transform('/[a-z]{1,}/i', {
Quantifier(path) {
const {node} = path;
if (
node.type === 'Range' &&
node.from === 1 &&
!node.to
) {
path.replace({
type: 'Quantifier',
kind: '+',
greedy: node.greedy,
});
}
},
});
console.log(re.toString());
console.log(re.toRegExp());
console.log(re.getAST());
Transform plugins
A transformation plugin is a module which exports a transformation handler. We have seen above how we can pass a handler object directly to the regexpTree.transform method, here we extract it into a separate module, so it can be implemented and shared independently:
Example of a plugin:
module.exports = {
Char({node}) {
if (node.kind === 'simple' && node.value === 'a') {
node.value = 'b';
node.symbol = 'b';
node.codePoint = 98;
}
},
};
Once we have this plugin ready, we can require it, and pass to the transform function:
const regexpTree = require('regexp-tree');
const plugin = require('./regexp-tree-a-to-b-transform');
const re = regexpTree.transform(/(a|c)a+[a-z]/, plugin);
console.log(re.toRegExp());
NOTE: we can also pass a list of plugins to the regexpTree.transform. In this case the plugins are applied in one pass in order. Another approach is to run several sequential calls to transform, setting up a pipeline, when a transformed AST is passed further to another plugin, etc.
You can see other examples of transform plugins in the optimizer/transforms or in the compat-transpiler/transforms directories.
Using generator API
The generator module generates regular expressions from corresponding AST nodes. In Node the module is exposed as regexpTree.generate method.
Example:
const regexpTree = require('regexp-tree');
const re = regexpTree.generate({
type: 'RegExp',
body: {
type: 'Char',
value: 'a',
symbol: 'a',
kind: 'simple',
codePoint: 97
},
flags: 'i',
});
console.log(re);
Using optimizer API
Optimizer transforms your regexp into an optimized version, replacing some sub-expressions with their idiomatic patterns. This might be good for different kinds of minifiers, as well as for regexp machines.
NOTE: the Optimizer is implemented as a set of regexp-tree plugins.
Example:
const regexpTree = require('regexp-tree');
const originalRe = /[a-zA-Z_0-9][A-Z_\da-z]*\e{1,}/;
const optimizedRe = regexpTree
.optimize(originalRe)
.toRegExp();
console.log(optimizedRe);
From CLI the optimizer is available via --optimize (-o) option:
regexp-tree -e '/[a-zA-Z_0-9][A-Z_\da-z]*\e{1,}/' -o
Result:
Optimized: /\w+e+/
Optimizer ESLint plugin
The optimizer module is also available as an ESLint plugin, which can be installed at: eslint-plugin-optimize-regex.
Using compat-transpiler API
The compat-transpiler module translates your regexp in new format or in new syntax, into an equivalent regexp in a legacy representation, so it can be used in engines which don't yet implement the new syntax.
NOTE: the compat-transpiler is implemented as a set of regexp-tree plugins.
Example, "dotAll" s flag:
/./s
Is translated into:
/[\0-\uFFFF]/
Or named capturing groups:
/(?<value>a)\k<value>\1/
Becomes:
/(a)\1\1/
To use the API from Node:
const regexpTree = require('regexp-tree');
const originalRe = '/(?<all>.)\\k<all>/s';
const compatTranspiledRe = regexpTree
.compatTranspile(originalRe)
.toRegExp();
console.log(compatTranspiledRe);
From CLI the compat-transpiler is available via --compat (-c) option:
regexp-tree -e '/(?<all>.)\k<all>/s' -c
Result:
Compat: /([\0-\uFFFF])\1/
Compat-transpiler Babel plugin
The compat-transpiler module is also available as a Babel plugin, which can be installed at: babel-plugin-transform-modern-regexp.
Note, the plugin also includes extended regexp features.
RegExp extensions
Besides future proposals, like named capturing group, and other which are being currently standardized, regexp-tree also supports non-standard features.
NOTE: "non-standard" means specifically ECMAScript standard, since in other regexp egnines, e.g. PCRE, Python, etc. these features are standard.
One of such featurs is x flag, which enables extended mode of regular expressions. In this mode most of whitespaces are ignored, and expressions can use #-comments.
Example:
/
# A regular expression for date.
(?<year>\d{4})- # year part of a date
(?<month>\d{2})- # month part of a date
(?<day>\d{2}) # day part of a date
/x
This is normally parsed by the regexp-tree parser, and compat-transpiler has full support for it; it's translated into:
/(\d{4})-(\d{2})-(\d{2})/
RegExp extensions Babel plugin
The regexp extensions are also available as a Babel plugin, which can be installed at: babel-plugin-transform-modern-regexp.
Note, the plugin also includes compat-transpiler features.
Creating RegExp objects
To create an actual RegExp JavaScript object, we can use regexpTree.toRegExp method:
const regexpTree = require('regexp-tree');
const re = regexpTree.toRegExp('/[a-z]/i');
console.log(
re.test('a'),
re.test('Z'),
);
Executing regexes
It is also possible to execute regular expressions using exec API method, which has support for new syntax, and features, such as named capturing group, etc:
const regexpTree = require('regexp-tree');
const re = `/
# A regular expression for date.
(?<year>\\d{4})- # year part of a date
(?<month>\\d{2})- # month part of a date
(?<day>\\d{2}) # day part of a date
/x`;
const string = '2017-04-14';
const result = regexpTree.exec(re, string);
console.log(result.groups);
Using interpreter API
NOTE: you can read more about implementation details of the interpreter in this series of articles.
In addition to executing regular expressions using JavaScript built-in RegExp engine, RegExp Tree also implements own interpreter based on classic NFA/DFA finite automaton engine.
Currently it aims educational purposes -- to trace the regexp matching process, transitioning in NFA/DFA states. It also allows building state transitioning table, which can be used for custom implementation. In API the module is exposed as fa (finite-automaton) object.
Example:
const {fa} = require('regexp-tree');
const re = /ab|c*/;
console.log(fa.test(re, 'ab'));
console.log(fa.test(re, ''));
console.log(fa.test(re, 'c'));
const nfa = fa.toNFA(re);
console.log(nfa.getTransitionTable());
const dfa = fa.toDFA(re);
console.log(dfa.getTransitionTable());
For more granular work with NFA and DFA, fa module also exposes convenient builders, so you can build NFA fragments directly:
const {fa} = require('regexp-tree');
const {
alt,
char,
or,
rep,
} = fa.builders;
const re = or(
alt(char('a'), char('b')),
rep(char('c'))
);
console.log(re.matches('ab'));
console.log(re.matches(''));
console.log(re.matches('c'));
const {DFA} = fa;
const reDFA = new DFA(re);
console.log(reDFA.matches('ab'));
console.log(reDFA.matches(''));
console.log(reDFA.matches('c'));
AST nodes specification
Below are the AST node types for different regular expressions patterns:
Char
A basic building block, single character. Can be escaped, and be of different kinds.
Simple char
Basic non-escaped char in a regexp:
z
Node:
{
type: 'Char',
value: 'z',
symbol: 'z',
kind: 'simple',
codePoint: 122
}
NOTE: to test this from CLI, the char should be in an actual regexp -- /z/.
Escaped char
\z
The same value, escaped flag is added:
{
type: 'Char',
value: 'z',
symbol: 'z',
kind: 'simple',
codePoint: 122,
escaped: true
}
Escaping is mostly used with meta symbols:
// Syntax error
*
\*
OK, node:
{
type: 'Char',
value: '*',
symbol: '*',
kind: 'simple',
codePoint: 42,
escaped: true
}
Meta char
A meta character should not be confused with an escaped char.
Example:
\n
Node:
{
type: 'Char',
value: '\\n',
symbol: '\n',
kind: 'meta',
codePoint: 10
}
Among other meta character are: ., \f, \r, \n, \t, \v, \0, [\b] (backspace char), \s, \S, \w, \W, \d, \D.
NOTE: Meta characters representing ranges (like ., \s, etc.) have undefined value for symbol and NaN for codePoint.
NOTE: \b and \B are parsed as Assertion node type, not Char.
Control char
A char preceded with \c, e.g. \cx, which stands for CTRL+x:
\cx
Node:
{
type: 'Char',
value: '\\cx',
symbol: undefined,
kind: 'control',
codePoint: NaN
}
HEX char-code
A char preceded with \x, followed by a HEX-code, e.g. \x3B (symbol ;):
\x3B
Node:
{
type: 'Char',
value: '\\x3B',
symbol: ';',
kind: 'hex',
codePoint: 59
}
Decimal char-code
Char-code:
\42
Node:
{
type: 'Char',
value: '\\42',
symbol: '*',
kind: 'decimal',
codePoint: 42
}
Octal char-code
Char-code started with \0, followed by an octal number:
\073
Node:
{
type: 'Char',
value: '\\073',
symbol: ';',
kind: 'oct',
codePoint: 59
}
Unicode
Unicode char started with \u, followed by a hex number:
\u003B
Node:
{
type: 'Char',
value: '\\u003B',
symbol: ';',
kind: 'unicode',
codePoint: 59
}
When using the u flag, unicode chars can also be represented using \u followed by a hex number between curly braces:
\u{1F680}
Node:
{
type: 'Char',
value: '\\u{1F680}',
symbol: '🚀',
kind: 'unicode',
codePoint: 128640
}
When using the u flag, unicode chars can also be represented using a surrogate pair:
\ud83d\ude80
Node:
{
type: 'Char',
value: '\\ud83d\\ude80',
symbol: '🚀',
kind: 'unicode',
codePoint: 128640,
isSurrogatePair: true
}
Character class
Character classes define a set of characters. A set may include as simple characters, as well as character ranges. A class can be positive (any from the characters in the class match), or negative (any but the characters from the class match).
Positive character class
A positive character class is defined between [ and ] brackets:
[a*]
A node:
{
type: 'CharacterClass',
expressions: [
{
type: 'Char',
value: 'a',
symbol: 'a',
kind: 'simple',
codePoint: 97
},
{
type: 'Char',
value: '*',
symbol: '*',
kind: 'simple',
codePoint: 42
}
]
}
NOTE: some meta symbols are treated as normal characters in a character class. E.g. * is not a repetition quantifier, but a simple char.
Negative character class
A negative character class is defined between [^ and ] brackets:
[^ab]
An AST node is the same, just negative property is added:
{
type: 'CharacterClass',
negative: true,
expressions: [
{
type: 'Char',
value: 'a',
symbol: 'a',
kind: 'simple',
codePoint: 97
},
{
type: 'Char',
value: 'b',
symbol: 'b',
kind: 'simple',
codePoint: 98
}
]
}
Character class ranges
As mentioned, a character class may also contain ranges of symbols:
[a-z]
A node:
{
type: 'CharacterClass',
expressions: [
{
type: 'ClassRange',
from: {
type: 'Char',
value: 'a',
symbol: 'a',
kind: 'simple',
codePoint: 97
},
to: {
type: 'Char',
value: 'z',
symbol: 'z',
kind: 'simple',
codePoint: 122
}
}
]
}
NOTE: it is a syntax error if to value is less than from value: /[z-a]/.
The range value can be the same for from and to, and the special range - character is treated as a simple character when it stands in a char position:
// from: 'a', to: 'a'
[a-a]
// from: '-', to: '-'
[---]
// simple '-' char:
[-]
// 3 ranges:
[a-zA-Z0-9]+
Alternative
An alternative (or concatenation) defines a chain of patterns followed one after another:
abc
A node:
{
type: 'Alternative',
expressions: [
{
type: 'Char',
value: 'a',
symbol: 'a',
kind: 'simple',
codePoint: 97
},
{
type: 'Char',
value: 'b',
symbol: 'b',
kind: 'simple',
codePoint: 98
},
{
type: 'Char',
value: 'c',
symbol: 'c',
kind: 'simple',
codePoint: 99
}
]
}
Another examples:
// 'a' with a quantifier, followed by 'b'
a?b
// A group followed by a class:
(ab)[a-z]
Disjunction
The disjunction defines "OR" operation for regexp patterns. It's a binary operation, having left, and right nodes.
Matches a or b:
a|b
A node:
{
type: 'Disjunction',
left: {
type: 'Char',
value: 'a',
symbol: 'a',
kind: 'simple',
codePoint: 97
},
right: {
type: 'Char',
value: 'b',
symbol: 'b',
kind: 'simple',
codePoint: 98
}
}
Groups
The groups play two roles: they define grouping precedence, and allow to capture needed sub-expressions in case of a capturing group.
Capturing group
"Capturing" means the matched string can be referred later by a user, including in the pattern itself -- by using backreferences.
Char a, and b are grouped, followed by the c char:
(ab)c
A node:
{
type: 'Alternative',
expressions: [
{
type: 'Group',
capturing: true,
number: 1,
expression: {
type: 'Alternative',
expressions: [
{
type: 'Char',
value: 'a',
symbol: 'a',
kind: 'simple',
codePoint: 97
},
{
type: 'Char',
value: 'b',
symbol: 'b',
kind: 'simple',
codePoint: 98
}
]
}
},
{
type: 'Char',
value: 'c',
symbol: 'c',
kind: 'simple',
codePoint: 99
}
]
}
As we can see, it also tracks the number of the group.
Another example:
// A grouped disjunction of a symbol, and a character class:
(5|[a-z])
Named capturing group
NOTE: Named capturing groups are not yet supported by JavaScript RegExp. It is an ECMAScript proposal which is at stage 3 at the moment.
A capturing group can be given a name using the (?<name>...) syntax, for any identifier name.
For example, a regular expressions for a date:
/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/u
For the group:
(?<foo>x)
We have the following node (the name property with value foo is added):
{
type: 'Group',
capturing: true,
name: 'foo',
number: 1,
expression: {
type: 'Char',
value: 'x',
symbol: 'x',
kind: 'simple',
codePoint: 120
}
}
Non-capturing group
Sometimes we don't need to actually capture the matched string from a group. In this case we can use a non-capturing group:
Char a, and b are grouped, but not captured, followed by the c char:
(?:ab)c
The same node, the capturing flag is false:
{
type: 'Alternative',
expressions: [
{
type: 'Group',
capturing: false,
expression: {
type: 'Alternative',
expressions: [
{
type: 'Char',
value: 'a',
symbol: 'a',
kind: 'simple',
codePoint: 97
},
{
type: 'Char',
value: 'b',
symbol: 'b',
kind: 'simple',
codePoint: 98
}
]
}
},
{
type: 'Char',
value: 'c',
symbol: 'c',
kind: 'simple',
codePoint: 99
}
]
}
Backreferences
A capturing group can be referenced in the pattern using notation of an escaped group number.
Matches abab string:
(ab)\1
A node:
{
type: 'Alternative',
expressions: [
{
type: 'Group',
capturing: true,
number: 1,
expression: {
type: 'Alternative',
expressions: [
{
type: 'Char',
value: 'a',
symbol: 'a',
kind: 'simple',
codePoint: 97
},
{
type: 'Char',
value: 'b',
symbol: 'b',
kind: 'simple',
codePoint: 98
}
]
}
},
{
type: 'Backreference',
kind: 'number',
number: 1,
reference: 1,
}
]
}
A named capturing group can be accessed using \k<name> pattern, and also using a numbered reference.
Matches www:
(?<foo>w)\k<foo>\1
A node:
{
type: 'Alternative',
expressions: [
{
type: 'Group',
capturing: true,
name: 'foo',
number: 1,
expression: {
type: 'Char',
value: 'w',
symbol: 'w',
kind: 'simple',
codePoint: 119
}
},
{
type: 'Backreference',
kind: 'name',
number: 1,
reference: 'foo'
},
{
type: 'Backreference',
kind: 'number',
number: 1,
reference: 1
}
]
}
Quantifiers
Quantifiers specify repetition of a regular expression (or of its part). Below are the quantifiers which wrap a parsed expression into a Repetition node. The quantifier itself can be of different kinds, and has Quantifier node type.
? zero-or-one
The ? quantifier is short for {0,1}.
a?
Node:
{
type: 'Repetition',
expression: {
type: 'Char',
value: 'a',
symbol: 'a',
kind: 'simple',
codePoint: 97
},
quantifier: {
type: 'Quantifier',
kind: '?',
greedy: true
}
}
* zero-or-more
The * quantifier is short for {0,}.
a*
Node:
{
type: 'Repetition',
expression: {
type: 'Char',
value: 'a',
symbol: 'a',
kind: 'simple',
codePoint: 97
},
quantifier: {
type: 'Quantifier',
kind: '*',
greedy: true
}
}
+ one-or-more
The + quantifier is short for {1,}.
// Same as `aa*`, or `a{1,}`
a+
Node:
{
type: 'Repetition',
expression: {
type: 'Char',
value: 'a',
symbol: 'a',
kind: 'simple',
codePoint: 97
},
quantifier: {
type: 'Quantifier',
kind: '+',
greedy: true
}
}
Range-based quantifiers
Explicit range-based quantifiers are parsed as follows:
Exact number of matches
a{3}
The type of the quantifier is Range, and from, and to properties have the same value:
{
type: 'Repetition',
expression: {
type: 'Char',
value: 'a',
symbol: 'a',
kind: 'simple',
codePoint: 97
},
quantifier: {
type: 'Quantifier',
kind: 'Range',
from: 3,
to: 3,
greedy: true
}
}
Open range
An open range doesn't have max value (assuming semantic "more", or Infinity value):
a{3,}
An AST node for such range doesn't contain to property:
{
type: 'Repetition',
expression: {
type: 'Char',
value: 'a',
symbol: 'a',
kind: 'simple',
codePoint: 97
},
quantifier: {
type: 'Quantifier',
kind: 'Range',
from: 3,
greedy: true
}
}
Closed range
A closed range has explicit max value: (which syntactically can be the same as min value):
a{3,5}
// Same as a{3}
a{3,3}
An AST node for a closed range:
{
type: 'Repetition',
expression: {
type: 'Char',
value: 'a',
symbol: 'a',
kind: 'simple',
codePoint: 97
},
quantifier: {
type: 'Quantifier',
kind: 'Range',
from: 3,
to: 5,
greedy: true
}
}
NOTE: it is a syntax error if the max value is less than min value: /a{3,2}/
Non-greedy
If any quantifier is followed by the ?, the quantifier becomes non-greedy.
Example:
a+?
Node:
{
type: 'Repetition',
expression: {
type: 'Char',
value: 'a',
symbol: 'a',
kind: 'simple',
codePoint: 97
},
quantifier: {
type: 'Quantifier',
kind: '+',
greedy: false
}
}
Other examples:
a??
a*?
a{1}?
a{1,}?
a{1,3}?
Assertions
Assertions appear as separate AST nodes, however instread of manipulating on the characters themselves, they assert certain conditions of a matching string. Examples: ^ -- beginning of a string (or a line in multiline mode), $ -- end of a string, etc.
^ begin marker
The ^ assertion checks whether a scanner is at the beginning of a string (or a line in multiline mode).
In the example below ^ is not a property of the a symbol, but a separate AST node for the assertion. The parsed node is actually an Alternative with two nodes:
^a
The node:
{
type: 'Alternative',
expressions: [
{
type: 'Assertion',
kind: '^'
},
{
type: 'Char',
value: 'a',
symbol: 'a',
kind: 'simple',
codePoint: 97
}
]
}
Since assertion is a separate node, it may appear anywhere in the matching string. The following regexp is completely valid, and asserts beginning of the string; it'll match an empty string:
^^^^^
$ end marker
The $ assertion is similar to ^, but asserts the end of a string (or a line in a multiline mode):
a$
A node:
{
type: 'Alternative',
expressions: [
{
type: 'Char',
value: 'a',
symbol: 'a',
kind: 'simple',
codePoint: 97
},
{
type: 'Assertion',
kind: '$'
}
]
}
And again, this is a completely valid regexp, and matches an empty string:
^^^^$$$$$
// valid too:
$^
Boundary assertions
The \b assertion check for word boundary, i.e. the position between a word and a space.
Matches x in x y, but not in xy:
x\b
A node:
{
type: 'Alternative',
expressions: [
{
type: 'Char',
value: 'x',
symbol: 'x',
kind: 'simple',
codePoint: 120
},
{
type: 'Assertion',
kind: '\\b'
}
]
}
The \B is vice-versa checks for non-word boundary. The following example matches x in xy, but not in x y:
x\B
A node is the same:
{
type: 'Alternative',
expressions: [
{
type: 'Char',
value: 'x',
symbol: 'x',
kind: 'simple',
codePoint: 120
},
{
type: 'Assertion',
kind: '\\B'
}
]
}
Lookahead assertions
These assertions check whether a pattern is followed (or not followed for the negative assertion) by another pattern.
Positive lookahead assertion
Matches a only if it's followed by b:
a(?=b)
A node:
{
type: 'Alternative',
expressions: [
{
type: 'Char',
value: 'a',
symbol: 'a',
kind: 'simple',
codePoint: 97
},
{
type: 'Assertion',
kind: 'Lookahead',
assertion: {
type: 'Char',
value: 'b',
symbol: 'b',
kind: 'simple',
codePoint: 98
}
}
]
}
Negative lookahead assertion
Matches a only if it's not followed by b:
a(?!b)
A node is similar, just negative flag is added:
{
type: 'Alternative',
expressions: [
{
type: 'Char',
value: 'a',
symbol: 'a',
kind: 'simple',
codePoint: 97
},
{
type: 'Assertion',
kind: 'Lookahead',
negative: true,
assertion: {
type: 'Char',
value: 'b',
symbol: 'b',
kind: 'simple',
codePoint: 98
}
}
]
}
Lookbehind assertions
NOTE: Lookbehind assertions are not yet supported by JavaScript RegExp. It is an ECMAScript proposal which is at stage 3 at the moment.
These assertions check whether a pattern is preceded (or not preceded for the negative assertion) by another pattern.
Positive lookbehind assertion
Matches b only if it's preceded by a:
(?<=a)b
A node:
{
type: 'Alternative',
expressions: [
{
type: 'Assertion',
kind: 'Lookbehind',
assertion: {
type: 'Char',
value: 'a',
symbol: 'a',
kind: 'simple',
codePoint: 97
}
},
{
type: 'Char',
value: 'b',
symbol: 'b',
kind: 'simple',
codePoint: 98
},
]
}
Negative lookbehind assertion
Matches b only if it's not preceded by a:
(?<!a)b
A node:
{
type: 'Alternative',
expressions: [
{
type: 'Assertion',
kind: 'Lookbehind',
negative: true,
assertion: {
type: 'Char',
value: 'a',
symbol: 'a',
kind: 'simple',
codePoint: 97
}
},
{
type: 'Char',
value: 'b',
symbol: 'b',
kind: 'simple',
codePoint: 98
},
]
}