Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More β†’
Socket
Sign inDemoInstall
Socket

oniguruma-to-es

Package Overview
Dependencies
Maintainers
0
Versions
13
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

oniguruma-to-es

Convert Oniguruma patterns to native JavaScript regexes

  • 0.1.1
  • Source
  • npm
  • Socket score

Version published
Weekly downloads
289K
increased by288.2%
Maintainers
0
Weekly downloads
Β 
Created
Source

Oniguruma-To-ES

npm version bundle

A lightweight Oniguruma to JavaScript RegExp transpiler that runs in the browser and on your server. Use it to:

  • Take advantage of Oniguruma's many extended regex featrures in JavaScript.
  • Run regexes intended for Oniguruma in JavaScript, such as those used in TextMate grammars (used by VS Code, Shiki syntax highlighter, etc.).
  • Share regexes across your Ruby and JavaScript code.

Compared to running the actual Oniguruma C library in JavaScript via WASM bindings (e.g. via vscode-oniguruma), this library is much lighter weight (the WASM binary alone is 460+ KB) and its regexes typically run much faster since they run as native JavaScript.

Try the demo REPL

Oniguruma-To-ES deeply understands all of the hundreds of large and small differences in Oniguruma and JavaScript regex syntax and behavior across multiple JavaScript version targets. It's obsessive about precisely following Oniguruma syntax rules and ensuring that the emulated features it supports have exactly the same behavior, even in extreme edge cases. And it's battle-tested on thousands of real-world Oniguruma regexes used in TextMate grammars (via the Shiki library). A few uncommon features can't be perfectly emulated and allow rare differences, but if you don't want to allow this, you can set the accuracy option to throw for such patterns (see details below).

[!NOTE] This library is currently in beta and has several known bugs. However, it's already quite robust and is ready for use. Please report any issues.

πŸ“œ Contents

πŸ•ΉοΈ Install and use

npm install oniguruma-to-es
import {toRegExp} from 'oniguruma-to-es';
const str = '…';
const pattern = '…';
// Works with all string/regexp methods since it returns a native JS regexp
str.match(toRegExp(pattern));
Using a global name (no import)
<script src="https://cdn.jsdelivr.net/npm/oniguruma-to-es/dist/index.min.js"></script>
<script>
  const {toRegExp} = OnigurumaToES;
</script>

πŸ”‘ API

toRegExp

Transpiles an Oniguruma pattern and returns a JavaScript RegExp.

[!TIP] Try it in the demo REPL.

function toRegExp(
  pattern: string,
  options?: Options
): RegExp | EmulatedRegExp;
Type Options
type Options = {
  accuracy?: 'strict' | 'default' | 'loose';
  avoidSubclass?: boolean;
  flags?: OnigurumaFlags;
  global?: boolean;
  hasIndices?: boolean;
  maxRecursionDepth?: number | null;
  target?: 'ES2018' | 'ES2024' | 'ESNext';
  tmGrammar?: boolean;
  verbose?: boolean;
};

See Options for more details.

toDetails

Transpiles an Oniguruma pattern to the parts needed to construct a JavaScript RegExp.

function toDetails(
  pattern: string,
  options?: Options
): {
  pattern: string;
  flags: string;
  strategy?: {
    name: string;
    subpattern?: string;
  };
};

The returned flags (as well as the pattern, of course) might be different than those provided, as a result of the emulation process. The returned pattern, flags, and strategy can be provided as arguments to the EmulatedRegExp constructor to produce the same result as toRegExp.

If the only keys returned are pattern and flags, they can optionally be provided to JavaScript's RegExp constructor instead. Setting option avoidSubclass to true ensures that this is always the case, and any patterns that rely on EmulatedRegExp's additional handling for emulation throw an error.

toOnigurumaAst

Generates an Oniguruma AST from an Oniguruma pattern.

function toOnigurumaAst(
  pattern: string,
  options?: {
    flags?: OnigurumaFlags;
  }
): OnigurumaAst;

EmulatedRegExp

Works the same as the native JavaScript RegExp constructor in all contexts, but can be provided results from toDetails to produce the same result as toRegExp.

class EmulatedRegExp extends RegExp {
  constructor(
    pattern: string | EmulatedRegExp,
    flags?: string,
    strategy?: {
      name: string;
      subpattern?: string;
    }
  );
};

πŸ”© Options

The following options are shared by functions toRegExp and toDetails.

accuracy

One of 'strict', 'default' (default), or 'loose'.

Sets the level of emulation rigor/strictness.

  • Strict: Throw if the pattern can't be emulated with identical behavior (even in rare edge cases) for the given target.
  • Default: The best choice in most cases. Permits a few close approximations of Oniguruma in order to support additional features.
  • Loose: Useful for non-critical matching like syntax highlighting where having some mismatches is better than not working.

Each level of increased accuracy supports a subset of patterns supported by lower accuracies. If a given pattern doesn't produce an error for a particular accuracy, its generated result will be identical with all lower levels of accuracy (given the same target).

More details
strict

Supports slightly fewer features, but the missing features are all relatively uncommon (see below).

default

Supports all features of strict, plus the following additional features, depending on target:

  • All targets (ESNext and earlier):
    • Enables use of \X using a close approximation of a Unicode extended grapheme cluster.
    • Enables recursion (e.g. via \g<0>) with a depth limit specified by option maxRecursionDepth.
  • ES2024 and earlier:
    • Enables use of case-insensitive backreferences to case-sensitive groups.
  • ES2018:
    • Enables use of POSIX classes [:graph:] and [:print:] using ASCII-based versions rather than the Unicode versions available for ES2024 and later. Other POSIX classes are always based on Unicode.
loose

Supports all features of default, plus the following:

  • Silences errors for unsupported uses of the search-start anchor \G (a flexible assertion that doesn’t have a direct equivalent in JavaScript).
    • Oniguruma-To-ES uses a variety of strategies to accurately emulate many common uses of \G. When using loose accuracy, if a \G assertion is found that doesn't have a known emulation strategy, the \G is simply removed and JavaScript's y (sticky) flag is added. This might lead to some false positives and negatives.

avoidSubclass

Default: false.

Disables advanced emulation strategies that rely on returning a RegExp subclass, resulting in certain patterns not being emulatable.

flags

Oniguruma flags; a string with i, m, and x in any order (all optional).

Flags can also be specified via modifiers in the pattern.

[!IMPORTANT] Oniguruma and JavaScript both have an m flag but with different meanings. Oniguruma's m is equivalent to JavaScript's s (dotAll).

global

Default: false.

Include JavaScript flag g (global) in the result.

hasIndices

Default: false.

Include JavaScript flag d (hasIndices) in the result.

maxRecursionDepth

Default: 6.

Specifies the recursion depth limit. Supported values are integers 2 to 100 and null. If null, any use of recursion results in an error.

Since recursion isn't infinite-depth like in Oniguruma, use of recursion also results in an error if using strict accuracy.

More details

Using a high limit has a (usually tiny) impact on transpilation and regex performance. Generally, this is only a problem if the regex has an existing issue with runaway backtracking that recursion exacerbates.

Higher limits have no effect on regexes that don't use recursion, so you should feel free to increase this if helpful.

target

One of 'ES2018', 'ES2024' (default), or 'ESNext'.

Sets the JavaScript language version for the generated pattern and flags. Later targets allow faster processing, simpler generated source, and support for additional features.

More details
  • ES2018: Uses JS flag u.
    • Emulation restrictions: Character class intersection, nested negated character classes, and Unicode properties added after ES2018 are not allowed.
    • Generated regexes might use ES2018 features that require Node.js 10 or a browser version released during 2018 to 2023 (in Safari's case). Minimum requirement for any regex is Node.js 6 or a 2016-era browser.
  • ES2024: Uses JS flag v.
    • No emulation restrictions.
    • Generated regexes require Node.js 20 or any 2023-era browser (compat table).
  • ESNext: Uses JS flag v and allows use of flag groups and duplicate group names.
    • Benefits: Faster transpilation, simpler generated source, and duplicate group names are preserved across separate alternation paths.
    • Generated regexes might use features that require Node.js 23 or a 2024-era browser (except Safari, which lacks support for flag groups).

tmGrammar

Default: false.

Leave disabled unless the regex will be used in a TextMate grammar processor that merges backreferences across begin and end patterns.

verbose

Default: false.

Disables optimizations that simplify the pattern when it doesn't change the meaning.

βœ… Supported features

Following are the supported features by target. The official Oniguruma syntax doc doesn't cover many of the finer details described here.

[!NOTE] Targets ES2024 and ESNext have the same emulation capabilities. Resulting regexes might have different source and flags, but they match the same strings.

Notice that nearly every feature below has at least subtle differences from JavaScript. Some features and subfeatures listed as unsupported are not emulatable using native JavaScript regexes, but support for others might be added in future versions of this library. Unsupported features throw an error.

FeatureExampleES2018ES2024+Subfeatures & JS differences
Flagsiiβœ…βœ… βœ” Unicode case folding (same as JS with flag u, v)
mmβœ…βœ… βœ” Equivalent to JS flag s (dotAll)
xxβœ…βœ… βœ” Unicode whitespace ignored
βœ” Line comments with #
βœ” Whitespace/comments allowed between a token and its quantifier
βœ” Whitespace/comments between a quantifier and the ?/+ that makes it lazy/possessive changes it to a chained quantifier
βœ” Whitespace/comments separate tokens (ex: \1 0)
βœ” Whitespace and # not ignored in char classes
Flag modifiersGroup(?im-x:…)βœ…βœ… βœ” Unicode case folding for i
βœ” Allows enabling and disabling the same flag (priority: disable)
βœ” Allows lone or multiple -
Directive(?im-x)βœ…βœ… βœ” Continues until end of pattern or group (spanning alternatives)
CharactersLiteralE, !βœ…βœ… βœ” Code point based matching (same as JS with flag u, v)
βœ” Standalone ], {, } don't require escaping
Identity escape\E, \!βœ…βœ… βœ” Different allowed set than JS
βœ” Allows multibyte chars
Escaped metachar\\, \.βœ…βœ… βœ” Same as JS
Shorthand\tβœ…βœ… βœ” The JS set plus \a, \e
\xNN\x7Fβœ…βœ… βœ” Allows 1 hex digit
βœ” Above 7F, is UTF-8 encoded byte (unlike JS)
βœ” Error for invalid encoded bytes
\uNNNN\uFFFFβœ…βœ… βœ” Same as JS with flag u, v
\x{…}\x{A}βœ…βœ… βœ” Allows leading 0s up to 8 total hex digits
Escaped num\20βœ…βœ… βœ” Can be backref, error, null, octal, identity escape, or any of these combined with literal digits, based on complex rules that differ from JS
βœ” Always handles escaped single digit 1-9 outside char class as backref
βœ” Allows null with 1-3 0s
βœ” Error for octal > 177
Control\cA, \C-Aβœ…βœ… βœ” With A-Za-z (JS: only \c form)
Other (extremely rare)❌❌ Not yet supported:
● Non-A-Za-z with \cx, \C-x
● Meta \M-x, \M-\C-x
● Octal code point \o{…}
● UTF-8 encoded bytes in octal
Character setsDigit, word\d, \w, etc.βœ…βœ… βœ” Same as JS (ASCII)
Hex digit\h, \Hβœ…βœ… βœ” ASCII
Whitespace\s, \Sβœ…βœ… βœ” ASCII (unlike JS)
Dot.βœ…βœ… βœ” Excludes only \n (unlike JS)
Any\Oβœ…βœ… βœ” Any char (with any flags)
βœ” Identity escape in char class
Not newline\Nβœ…βœ… βœ” Identity escape in char class
Unicode property \p{L},
\P{L}
βœ…[1]βœ… βœ” Binary properties
βœ” Categories
βœ” Scripts
βœ” Aliases
βœ” POSIX properties
βœ” Invert with \p{^…}, \P{^…}
βœ” Insignificant spaces, underscores, and casing in names
βœ” \p, \P without { is an identity escape
βœ” Error for key prefixes
βœ” Error for props of strings
❌ Blocks (wontfix[2])
Variable-length setsNewline\Rβœ…βœ… βœ” Matched atomically
Grapheme\Xβ˜‘οΈβ˜‘οΈ ● Uses a close approximation
βœ” Matched atomically
Character classesBase[…], [^…]βœ…βœ… βœ” Unescaped - outside of range is literal in some contexts (different than JS rules in any mode)
βœ” Fewer chars require escaping than JS
βœ” Error for reversed range (same as JS)
Empty[], [^]βœ…βœ… βœ” Error
Range[a-z]βœ…βœ… βœ” Same as JS with flag u, v
POSIX class [[:word:]],
[[:^word:]]
β˜‘οΈ[3]βœ… βœ” All use Unicode definitions
Nested class[…[…]]β˜‘οΈ[4]βœ… βœ” Same as JS with flag v
Intersection[…&&…]βŒβœ… βœ” Doesn't require nested classes for intersection of union and ranges
AssertionsLine start, end^, $βœ…βœ… βœ” Always "multiline"
βœ” Only \n as newline
String start, end\A, \zβœ…βœ… βœ” Same as JS ^ $ without JS flag m
String end or before terminating newline\Zβœ…βœ… βœ” Only \n as newline
Search start\Gβ˜‘οΈβ˜‘οΈ ● Common uses supported
Word boundary\b, \Bβœ…βœ… βœ” Unicode based (unlike JS)
Lookaround (?=…),
(?!…),
(?<=…),
(?<!…)
βœ…βœ… βœ” Same as JS
βœ” Allows variable-length quantifiers and alternation within lookbehind
QuantifiersGreedy, lazy*, +?, {2,}, etc.βœ…βœ… βœ” Includes all JS forms
βœ” Adds {,n} for min 0
βœ” Explicit bounds have upper limit of 100,000 (unlimited in JS)
βœ” Error with assertions (same as JS with flag u, v)
Possessive?+, *+, ++βœ…βœ… βœ” + suffix doesn't make interval ({…}) quantifiers possessive (creates a chained quantifier)
Chained**, ??+*, {2,3}+, etc.βœ…βœ… βœ” Further repeats the preceding repetition
GroupsNoncapturing(?:…)βœ…βœ… βœ” Same as JS
Atomic(?>…)βœ…βœ… βœ” Supported
Capturing(…)βœ…βœ… βœ” Is noncapturing if named capture present
Named capturing (?<a>…),
(?'a'…)
βœ…βœ… βœ” Duplicate names allowed (including within the same alternation path) unless directly referenced by a subroutine
βœ” Error for names invalid in Oniguruma or JS
BackreferencesNumbered\1βœ…βœ… βœ” Error if named capture used
βœ” Refs the most recent of a capture/subroutine set
Enclosed numbered, relative \k<1>,
\k'1',
\k<-1>,
\k'-1'
βœ…βœ… βœ” Error if named capture used
βœ” Allows leading 0s
βœ” Refs the most recent of a capture/subroutine set
βœ” \k without < ' is an identity escape
Named \k<a>,
\k'a'
βœ…βœ… βœ” For duplicate group names, rematch any of their matches (multiplex)
βœ” Refs the most recent of a capture/subroutine set (no multiplex)
βœ” Combination of multiplex and most recent of capture/subroutine set if duplicate name is indirectly created by a subroutine
To nonparticipating groupsβ˜‘οΈβ˜‘οΈ βœ” Error if group to the right[5]
βœ” Duplicate names (and subroutines) to the right not included in multiplex
βœ” Fail to match (or don't include in multiplex) ancestor groups and groups in preceding alternation paths
❌ Some rare cases are indeterminable at compile time and use the JS behavior of matching an empty string
SubroutinesNumbered, relative \g<1>,
\g'1',
\g<-1>,
\g'-1',
\g<+1>,
\g'+1'
βœ…βœ… βœ” Allowed before reffed group
βœ” Can be nested (any depth)
βœ” Doesn't alter backref nums
βœ” Reuses flags from the reffed group (ignores local flags)
βœ” Replaces most recent captured values (for backrefs)
βœ” \g without < ' is an identity escape
βœ” Error if named capture used
Named \g<a>,
\g'a'
βœ…βœ… ● Same behavior as numbered
βœ” Error if reffed group uses duplicate name
RecursionFull pattern \g<0>,
\g'0'
β˜‘οΈβ˜‘οΈ ● Has depth limit[6]
Named, numbered, relative (?<a>…\g<a>?…),
(…\g<1>?…),
(…\g<-1>?…), etc.
β˜‘οΈβ˜‘οΈ ● Has depth limit[6]
OtherComment group(?#…)βœ…βœ… βœ” Allows escaping \), \\
βœ” Comments allowed between a token and its quantifier
βœ” Comments between a quantifier and the ?/+ that makes it lazy/possessive changes it to a chained quantifier
Alternation…|β€¦βœ…βœ… βœ” Same as JS
Keep\Kβ˜‘οΈβ˜‘οΈ ● Supported if at top level and no top-level alternation is used
Absence operator(?~…)❌❌ ● Some forms are supportable
Conditional(?(1)…)❌❌ ● Some forms are supportable
Char sequence \x{1 2 …N},
\o{1 2 …N}
❌❌ ● Not yet supported
JS features unknown to Oniguruma are handled using Oniguruma syntaxβœ…βœ… βœ” \u{…} is an error
βœ” [\q{…}] matches q, etc.
βœ” [a--b] includes the invalid reversed range a to -
Invalid Oniguruma syntaxβœ…βœ… βœ” Error

The table above doesn't include all aspects that Oniguruma-To-ES emulates (including error handling, most aspects that work the same as in JavaScript, and many aspects of non-JavaScript features that work the same in the other regex flavors that support them).

Footnotes

  1. Target ES2018 doesn't allow using Unicode property names added in JavaScript specifications after ES2018.
  2. Unicode blocks (which in Oniguruma are used with an In… prefix) are easily emulatable but their character data would significantly increase library weight. They're also a flawed and arguably-unuseful feature, given the ability to use Unicode scripts and other properties.
  3. With target ES2018, the specific POSIX classes [:graph:] and [:print:] use ASCII-based versions rather than the Unicode versions available for target ES2024 and later, and they result in an error if using strict accuracy.
  4. Target ES2018 doesn't support nested negated character classes.
  5. It's not an error for numbered backreferences to come before their referenced group in Oniguruma, but an error is the best path for Oniguruma-To-ES because (1) most placements are mistakes and can never match (based on the Oniguruma behavior for backreferences to nonparticipating groups), (2) erroring matches the behavior of named backreferences, and (3) the edge cases where they're matchable rely on rules for backreference resetting within quantified groups that are different in JavaScript and aren't emulatable. Note that it's not a backreference in the first place if using \10 or higher and not as many capturing groups are defined to the left (it's an octal or identity escape).
  6. The recursion depth limit is specified by option maxRecursionDepth. Some forms of recursion (multiple recursions in the same pattern, and recursion with backreferences) aren't yet supported. Patterns that would error in Oniguruma due to triggering infinite recursion might find a match in Oniguruma-To-ES since recursion is bounded (future versions will detect this and error at transpilation time).

γŠ—οΈ Unicode / mixed case-sensitivity

Oniguruma-To-ES fully supports mixed case-sensitivity (and handles the Unicode edge cases) regardless of JavaScript target. It also restricts Unicode properties to those supported by Oniguruma and the target JavaScript version.

Oniguruma-To-ES focuses on being lightweight to make it better for use in browsers. This is partly achieved by not including heavyweight Unicode character data, which imposes a couple of minor/rare restrictions:

  • Character class intersection and nested negated character classes are unsupported with target ES2018. Use target ES2024 or later if you need support for these features.
  • With targets before ESNext, a handful of Unicode properties that target a specific character case (ex: \p{Lower}) can't be used case-insensitively in patterns that contain other characters with a specific case that are used case-sensitively.
    • In other words, almost every usage is fine, including A\p{Lower}, (?i:A\p{Lower}), (?i:A)\p{Lower}, (?i:A(?-i:\p{Lower})), and \w(?i:\p{Lower}), but not A(?i:\p{Lower}).
    • Using these properties case-insensitively is basically never done intentionally, so you're unlikely to encounter this error unless it's catching a mistake.

πŸ‘€ Similar projects

JsRegex transpiles Onigmo regexes to JavaScript (Onigmo is a fork of Oniguruma with mostly shared syntax and behavior). It's written in Ruby and relies on the Regexp::Parser Ruby gem, which means regexes must be pre-transpiled on the server to use them in JavaScript. Note that JsRegex doesn't always translate edge case behavior differences.

🏷️ About

Oniguruma-To-ES was created by Steven Levithan.

If you want to support this project, I'd love your help by contributing improvements, sharing it with others, or sponsoring ongoing development.

Β© 2024–present. MIT License.

Keywords

FAQs

Package last updated on 09 Nov 2024

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚑️ by Socket Inc