The JavaScript RegExp
class has a problem. Well, it has a few, but we're just focusing on one specific problem in this post. At Socket, we are currently experiencing a significant issue with RegExp
that we would love to help fix.
The Socket code analysis engine spends a lot of time scanning files and we try to do it quickly so that our customers can be happy with timely results. Regular expressions can often be quite fast and expressive but there is a dark side to their design in JS. In particular, it is hard to know about matching progress or non-progress.
Consider an overly naive check for eval
in some file:
const eval_pattern = /\beval\b/g
// file contains: 'not_eval_at_all()'
for await (const chunk of file) {
eval_pattern.exec(chunk)
}
This actually has a bug and a few problems.
- False positives; It may actually say that the file has eval if I/O streams the chunks as:
"not_"
, "eval"
, "_at_all()"
. Fixing this means buffering the input. - Buffering the input means starting over at the start of the buffer every time even though
"not_"
would never need to be searched again. - Buffering the input means bloating memory. We have seen some very impressively large files out there in the OSS ecosystem.
- Buffering the input means extra bookkeeping, string concatenation and manually manipulating
eval_pattern.lastIndex
in particular. - Even if we find a match, some kinds of patterns like
\b
may change depending on what occurs after the match like /a\b/.test('a')
compared to /a\b/.test('ab')
. So if the match is at the end of a chunk, you have to keep buffering after it still; code has gotta keep buffering and may falsely report a full match at a chunk boundary.
Additionally, it is possible to have other kinds of bugs like false negatives given different kinds of patterns if using code similar to above:
const safe_str = ''
for (const chunk of ['BAD_', 'WORD']) {
safe_str += chunk.replaceAll(/BAD_WORD/g, '')
}
// safe_str includes BAD_WORD, oops!
That is why we are proposing a feature to TC39 and looking for a champion to make a solution to give incremental progress to RegExp matching. You can look at our proposal and see that it actually has a variety of scenarios to account for. In particular: lookbehind, lookahead, and aggregation of quantifiers are important.
Our hope is with the goals listed in the proposal to allow for less wasted idle time on I/O, less duplicated scanning of strings, and reducing memory pressure. If done right, it might even be possible to persist progress even if the JS VM is spun down! This is a very exciting thing that may greatly improve text processing in JS.