node-re2
This project provides bindings for RE2:
fast, safe alternative to backtracking regular expression engines written by Russ Cox.
To learn more about RE2, start with an overview
Regular Expression Matching in the Wild. More resources can be found
at his Implementing Regular Expressions page.
RE2
's regular expression language is almost a superset of what is provided by RegExp
(see Syntax),
but it lacks two features: backreferences and lookahead assertions. See below for more details.
RE2
object emulates standard RegExp
making it a practical drop-in replacement in most cases.
RE2
is extended to provide String
-based regular expression methods as well. To help converting
RegExp
objects to RE2
its constructor can take RegExp
directly honoring all properties.
It can work with node.js buffers directly reducing overhead
on recoding and copying characters, and making processing/parsing long files fast.
Why use node-re2?
The built-in Node.js regular expression engine can run in exponential time with a special combination:
- A vulnerable regular expression
- "Evil input"
This can lead to what is known as a Regular Expression Denial of Service (ReDoS).
To tell if your regular expressions are vulnerable, you might try the one of these projects:
However, neither project is perfect.
node-re2 can protect your Node.js application from ReDoS.
node-re2 makes vulnerable regular expression patterns safe by evaluating them in RE2
instead of the built-in Node.js regex engine.
Standard features
RE2
object can be created just like RegExp
:
Supported properties:
Supported methods:
Extensions
Shortcut construction
RE2
object can be created from a regular expression:
var re1 = new RE2(/ab*/ig);
var re2 = new RE2(re1);
String
methods
Standard String
defines four more methods that can use regular expressions. RE2
provides them as methods
exchanging positions of a string, and a regular expression:
re2.match(str)
re2.replace(str, newSubStr|function)
re2.search(str)
re2.split(str[, limit])
Buffer
support
In order to support Buffer
directly, most methods can accept buffers instead of strings. It speeds up all operations.
Following signatures are supported:
re2.exec(buf)
re2.test(buf)
re2.match(buf)
re2.search(buf)
re2.split(buf[, limit])
re2.replace(buf, replacer)
Differences with their string-based versions:
- All buffers are assumed to be encoded as UTF-8
(ASCII is a proper subset of UTF-8).
- Instead of strings they return
Buffer
objects, even in composite objects. A buffer can be converted to a string with
buf.toString()
. - All offsets and lengths are in bytes, rather than characters (each UTF-8 character can occupy from 1 to 4 bytes).
This way users can properly slice buffers without costly recalculations from characters to bytes.
When re2.replace()
is used with a replacer function, the replacer can return a buffer, or a string. But all arguments
(except for an input object) will be strings, and an offset will be in characters. If you prefer to deal
with buffers and byte offsets in a replacer function, set a property useBuffers
to true
on the function:
function strReplacer(match, offset, input) {
return "<= " + offset + " characters|";
}
RE2("б").replace("абв", strReplacer);
function bufReplacer(match, offset, input) {
return "<= " + offset + " bytes|";
}
bufReplacer.useBuffers = true;
RE2("б").replace("абв", bufReplacer);
This feature works for string and buffer inputs. If a buffer was used as an input, its output will be returned as
a buffer too, otherwise a string will be returned.
Calculate length
Two functions to calculate string sizes between
UTF-8 and
UTF-16 are exposed on RE2
:
RE2.getUtf8Length(str)
— calculates a buffer size in bytes to encode a UTF-16 string as
a UTF-8 buffer.RE2.getUtf16Length(buf)
— calculates a string size in characters to encode a UTF-8 buffer as
a UTF-16 string.
JavaScript supports UCS-2 strings with 16-bit characters, while node.js 0.11 supports full UTF-16 as
a default string.
How to install
Installation:
npm install re2
How to use
It is used just like a RegExp
object.
var RE2 = require("re2");
var re = new RE2("a(b*)");
var result = re.exec("abbc");
console.log(result[0]);
console.log(result[1]);
result = re.exec("aBbC");
console.log(result[0]);
console.log(result[1]);
re = new RE2("a(b*)", "i");
result = re.exec("aBbC");
console.log(result[0]);
console.log(result[1]);
var regexp = new RegExp("a(b*)", "i");
re = new RE2(regexp);
result = re.exec("aBbC");
console.log(result[0]);
console.log(result[1]);
re = new RE2(/a(b*)/i);
result = re.exec("aBbC");
console.log(result[0]);
console.log(result[1]);
var rex = new RE2(re);
result = rex.exec("aBbC");
console.log(result[0]);
console.log(result[1]);
result = new RE2("ab*").exec("abba");
result = RE2("ab*").exec("abba");
Limitations (Things RE2 does not support)
RE2
consciously avoids any regular expression features that require worst-case exponential time to evaluate.
These features are essentially those that describe a Context-Free Language (CFL) rather than a Regular Expression,
and are extensions to the traditional regular expression language because some people don't know when enough is enough.
The most noteworthy missing features are backreferences and lookahead assertions.
If your application uses these features, you should continue to use RegExp
.
But since these features are fundamentally vulnerable to
ReDoS,
you should strongly consider replacing them.
RE2
will throw a SyntaxError
if you try to declare a regular expression using these features.
If you are evaluating an externally-provided regular expression, wrap your RE2
declarations in a try-catch block. It allows to use RegExp
, when RE2
misses a feature:
var re = /(a)+(b)*/;
try {
re = new RE2(re);
} catch (e) {
}
var result = re.exec(sample);
In addition to these missing features, RE2
also behaves somewhat differently from the built-in regular expression engine in corner cases.
Backreferences
RE2
doesn't support backreferences, which are numbered references to previously
matched groups, like so: \1
, \2
, and so on. Example of backrefrences:
/(cat|dog)\1/.test("catcat");
/(cat|dog)\1/.test("dogdog");
/(cat|dog)\1/.test("catdog");
/(cat|dog)\1/.test("dogcat");
Lookahead assertions
RE2
doesn't support lookahead assertions, which are ways to allow a matching dependent on subsequent contents.
/abc(?=def)/;
/abc(?!def)/;
Mismatched behavior
RE2
and the built-in regex engines disagree a bit.
Before you switch to RE2
, verify that your regular expressions continue to work as expected.
They should do so in the vast majority of cases.
Here is an example of a case where they may not:
var RE2 = require("../re2");
var pattern = '(?:(a)|(b)|(c))+';
var built_in = new RegExp(pattern);
var re2 = new RE2(pattern);
var input = 'abc';
var bi_res = built_in.exec(input);
var re2_res = re2.exec(input);
console.log('bi_res: ' + bi_res);
console.log('re2_res : ' + re2_res);
Working on this project
This project uses git submodules, so the correct way to get it is:
git clone git@github.com:uhop/node-re2.git
cd node-re2
git submodule update --init --recursive
In order to build it, make sure that you have all necessary gyp
dependencies
for your platform, then run:
npm install
Or:
yarn
Release history
- 1.5.0 Bug fixes, error checks, better docs. Thx Jamie Davis, and omg!
- 1.4.1 Minor corrections in README.
- 1.4.0 Use re2 as a git submodule. Thx Ben James!
- 1.3.3 Refreshed dependencies.
- 1.3.2 Updated references in README (re2 was moved to github).
- 1.3.1 Refreshed dependencies, new Travis-CI config.
- 1.3.0 Upgraded NAN to 1.6.3, now we support node.js 0.10.36, 0.12.0, and io.js 1.3.0. Thx @reid!
- 1.2.0 Documented getUtfXLength() functions. Added support for
\c
and \u
commands. - 1.1.1 Minor corrections in README.
- 1.1.0 Buffer-based API is public. Unicode is fully supported.
- 1.0.0 Implemented all
RegExp
methods, and all relevant String
methods. - 0.9.0 The initial public release.
License
BSD