Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

Research

Engineering

Let's make JS RegExps Streamy

Proposing a more usable RegExp for JS in light of async I/O and streaming.

Let's make JS RegExps Streamy

Bradley Meck Farias

February 17, 2023


The JavaScript RegExp class has a problem. Well, it has a few, but we're just focusing on one specific problem in this post. At Socket, we are currently experiencing a significant issue with RegExp that we would love to help fix.

The Socket code analysis engine spends a lot of time scanning files and we try to do it quickly so that our customers can be happy with timely results. Regular expressions can often be quite fast and expressive but there is a dark side to their design in JS. In particular, it is hard to know about matching progress or non-progress.

Consider an overly naive check for eval in some file:

const eval_pattern = /\beval\b/g
// file contains: 'not_eval_at_all()'
for await (const chunk of file) {
  eval_pattern.exec(chunk)
}

This actually has a bug and a few problems.

  • False positives; It may actually say that the file has eval if I/O streams the chunks as: "not_", "eval", "_at_all()". Fixing this means buffering the input.
  • Buffering the input means starting over at the start of the buffer every time even though "not_" would never need to be searched again.
  • Buffering the input means bloating memory. We have seen some very impressively large files out there in the OSS ecosystem.
  • Buffering the input means extra bookkeeping, string concatenation and manually manipulating eval_pattern.lastIndex in particular.
  • Even if we find a match, some kinds of patterns like \b may change depending on what occurs after the match like /a\b/.test('a') compared to /a\b/.test('ab'). So if the match is at the end of a chunk, you have to keep buffering after it still; code has gotta keep buffering and may falsely report a full match at a chunk boundary.

Additionally, it is possible to have other kinds of bugs like false negatives given different kinds of patterns if using code similar to above:

const safe_str = ''
for (const chunk of ['BAD_', 'WORD']) {
  safe_str += chunk.replaceAll(/BAD_WORD/g, '')
}
// safe_str includes BAD_WORD, oops!

That is why we are proposing a feature to TC39 and looking for a champion to make a solution to give incremental progress to RegExp matching. You can look at our proposal and see that it actually has a variety of scenarios to account for. In particular: lookbehind, lookahead, and aggregation of quantifiers are important.

Our hope is with the goals listed in the proposal to allow for less wasted idle time on I/O, less duplicated scanning of strings, and reducing memory pressure. If done right, it might even be possible to persist progress even if the JS VM is spun down! This is a very exciting thing that may greatly improve text processing in JS.

Subscribe to our newsletter

Get notified when we publish new security blog posts!

Try it now

Ready to block malicious and vulnerable dependencies?

Install GitHub AppBook a demo

Related posts

Back to all posts
SocketSocket SOC 2 Logo

Product

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc