
Product
Announcing Precomputed Reachability Analysis in Socket
Socket’s precomputed reachability slashes false positives by flagging up to 80% of vulnerabilities as irrelevant, with no setup and instant results.
oniguruma-parser
Advanced tools
Parse, validate, traverse, transform, and optimize Oniguruma regular expressions
A TypeScript library for parsing, validating, traversing, transforming, and optimizing Oniguruma regular expressions.
[!NOTE] Oniguruma is a regular expression engine written in C that's used in Ruby (via a fork named Onigmo), PHP (
mb_ereg
, etc.), TextMate grammars (used by VS Code, Shiki, etc.), and many other tools.
This library has been battle-tested by Oniguruma-To-ES and tm-grammars, which are used by Shiki to process tens of thousands of real-world Oniguruma regexes.
npm install oniguruma-parser
import {toOnigurumaAst} from 'oniguruma-parser';
The following modules are available in addition to the root 'oniguruma-parser'
export:
OnigurumaAst
nodes. Also includes the parse
function, wrapped by toOnigurumaAst
.OnigurumaAst
.OnigurumaAst
to pattern and flags strings.To parse an Oniguruma pattern (with optional flags and compile-time options) and return an AST, call toOnigurumaAst
, which uses the following type definition:
function toOnigurumaAst(
pattern: string,
options?: {
flags?: string;
rules?: {
captureGroup?: boolean;
singleline?: boolean;
};
}
): OnigurumaAst;
For example:
import {toOnigurumaAst} from 'oniguruma-parser';
const ast = toOnigurumaAst('A.*');
console.log(ast);
/* →
{ type: 'Regex',
body: [
{ type: 'Alternative',
body: [
{ type: 'Character',
value: 65,
},
{ type: 'Quantifier',
kind: 'greedy',
min: 0,
max: Infinity,
body: {
type: 'CharacterSet',
kind: 'dot',
},
},
],
},
],
flags: {
type: 'Flags',
ignoreCase: false,
dotAll: false,
extended: false,
digitIsAscii: false,
posixIsAscii: false,
spaceIsAscii: false,
wordIsAscii: false,
textSegmentMode: null,
},
}
*/
An error is thrown if the provided pattern or flags aren't valid in Oniguruma.
Note:
toOnigurumaAst
is a wrapper around the parser module'sparse
function that makes it easier to use by automatically providing the appropriate Unicode property validation data.
See details and examples in the traverser module's readme.
See details and examples in the generator module's readme.
This library includes one of the few implementations (for any regex flavor) of a "regex optimizer" that can minify and improve the performance and readability of regexes prior to use.
Example:
(?x) (?:\!{1,}) (\b(?:ark|arm|art)\b) [[^0-9A-Fa-f]\P{^Nd}\p{ Letter }]
Becomes:
!+\b(ar[kmt])\b[\H\d\p{L}]
Optimized regexes always match exactly the same strings.
See more details and examples in the optimizer module's readme.
[!TIP] 🧪 Try the optimizer demo.
Known differences will be resolved in future versions.
The following rarely-used features throw errors since they aren't yet supported:
\cx
\C-x
, meta \M-x
\M-\C-x
, octal code points \o{…}
, and octal encoded bytes ≥ \200
.\x{H H …}
\o{O O …}
.(?~|…|…)
, stoppers (?~|…)
, and clearers (?~|)
.(?(…)…)
, etc.(?{…})
, etc.\k<+N>
) and backreferences with recursion level (\k<N+N>
, etc.).D
P
S
W
y{g}
y{w}
within pattern modifiers, and whole-pattern modifiers C
I
L
.Despite these gaps, more than 99.99% of real-world Oniguruma regexes are supported, based on a sample of ~55k regexes used in TextMate grammars (conditionals were used in three regexes, and other unsupported features weren't used at all). Some of the Oniguruma features above are so exotic that they aren't used in any public code on GitHub.
This library currently treats it as an error if a numbered backreference comes before its referenced group. This is a rare issue because:
\1
–\9
since it's not a backreference in the first place if using \10
or higher and not as many capturing groups are defined to the left (it's an octal or identity escape).The following don't yet throw errors, but should:
-
or +
(which is separate from how these symbols are used in relative numbered backreferences).Although any number of digits are supported for enclosed \k<…>
/\k'…'
backreferences (assuming the backreference refers to a valid capturing group), unenclosed backreferences currently support only up to three digits (\999
). In other words, \1000
is handled as \100
followed by 0
even if 1,000+ captures appear to the left.
Note: An apparent bug in vscode-oniguruma (v2.0.1 tested) prevents any regex with more than 999 captures from working. They fail to match anything, with no error.
This library intentionally doesn't reproduce bugs, and it currently throws errors for several edge cases that trigger Oniguruma bugs and undefined behavior.
Although nested absence functions like (?~(?~…))
don't throw an error in Oniguruma, they produce self-described "strange" results, and Oniguruma's docs state that "nested absent functions are not supported and the behavior is undefined".
In this library, nested absence functions throw an error. In future versions, parsing of nested absence functions will follow Oniguruma and no longer error.
\x
as a NUL
characterIn Oniguruma, \x
is an escape for the NUL
character (equivalent to \0
, \x00
, etc.) if it's not followed by {
or a hexadecimal digit.
In this library, bare \x
throws an error.
Additional behavior details for \x
in Oniguruma:
\x
is an error if followed by a {
that's followed by a hexadecimal digit but doesn't form a valid \x{…}
code point escape. Ex: \x{F
and \x{0,2}
are errors.\x
matches a literal x
if followed by a {
that isn't followed by a hexadecimal digit. Ex: \x{
matches x{
, \x{G
matches x{G
, and \x{,2}
matches 0–2 x
characters, since {,2}
is a quantifier with an implicit 0 min.\x
matches a literal x
if it appears at the very end of a pattern. This is a bug.In future versions, parsing of \x
will follow the Oniguruma rules above (excluding bugs), removing some cases where it currently errors.
\u
Normally, any incomplete \uHHHH
(including bare \u
) throws an error. However, in Oniguruma 6.9.10 and earlier (report), bare \u
matches a literal u
if it appears at the very end of a pattern. This is a bug.
In this library, incomplete \u
is always an error.
\x80
to \xFF
Context: Unlike
\uHHHH
and enclosed\x{H…}
(which match code points), Oniguruma's unenclosed\xHH
represents an encoded byte, which means that, unlike in other regex flavors,\x80
to\xFF
are treated as fragments of a code unit. Ex:[\0-\xE2\x82\xAC]
is equivalent to[\0-\u20AC]
.
Invalid standalone encoded bytes should throw an error, but several related bugs are present in Oniguruma 6.9.10 and earlier (report).
In this library, they always throw an error.
Behavior details in Oniguruma:
\x80
to \xF4
throw an error.\xF5
to \xFF
fail to match anything, but don't throw. This is a bug.\x80
to \xBF
and \xF5
to \xFF
are treated as \x7F
. This is a bug.[^\0-\xFF]
), \xF5
to \xFF
are treated as \x{10FFFF}
. This is a bug.All versions of this library to date have followed the rules of Oniguruma 6.9.10 (released 2025-01-01), which uses Unicode 16.0.0.
At least since Oniguruma 6.0.0 (released 2016-05-09), regex syntax changes in new versions have been backward compatible. Some versions added new syntax that was previously an error (such as new Unicode property names), and in a few cases, edge case parsing bugs were fixed.
Oniguruma 6.9.8 (released 2022-04-29) is an important baseline for JavaScript projects, since that's the version used by vscode-oniguruma 1.7.0 to the latest 2.0.1. It's therefore used in recent versions of various projects, including VS Code and Shiki. However, the regex syntax differences between Oniguruma 6.9.8 and 6.9.10 are so minor that this is a non-issue.
Contributions are welcome. See the guide to help you get started.
Created by Steven Levithan and contributors.
If you want to support this project, I'd love your help by contributing improvements (guide), sharing it with others, or sponsoring ongoing development.
MIT License.
FAQs
Parse, validate, traverse, transform, and optimize Oniguruma regular expressions
The npm package oniguruma-parser receives a total of 535,425 weekly downloads. As such, oniguruma-parser popularity was classified as popular.
We found that oniguruma-parser demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Product
Socket’s precomputed reachability slashes false positives by flagging up to 80% of vulnerabilities as irrelevant, with no setup and instant results.
Product
Socket is launching experimental protection for Chrome extensions, scanning for malware and risky permissions to prevent silent supply chain attacks.
Product
Add secure dependency scanning to Claude Desktop with Socket MCP, a one-click extension that keeps your coding conversations safe from malicious packages.