UnionReplacer
UnionReplacer provides one-pass global search and replace functionality
using multiple regular expressions and corresponging replacements.
Otherwise the behavior matches String.prototype.replace(regexp, newSubstr|function)
.
Outline
Installation and usage
In browsers:
<script src="https://unpkg.com/union-replacer/dist/union-replacer.umd.js" />
Using npm:
npm install union-replacer
In Node.js:
const UnionReplacer = require('union-replacer');
With TypeScript:
import UnionReplacer from 'union-replacer';
import * as UnionReplacer from 'union-replacer';
import UnionReplacer = require('union-replacer');
Synopsis
replacer = new UnionReplacer(replace_pairs, [flags])
newStr = replacer.replace(str)
Parameters
replace_pairs
: array of [regexp, replacement]
arrays, where
regexp
: particular regexp element in unioned regexp. Its eventual flags are ignored.replacement
corresponds with String.prototype.replace
:
flags
: regular expression flags to be set on the main underlying regexp, defaults to gm
.
API updates
- v2.0 removes the
addReplacement()
method, see
#4 for details. - v2.0 introduces TypeScript type definitions along with precise JSDoc type definitions.
Examples
Convenient one-pass escaping of HTML special chars
const htmlEscapes = [
[/</, '<'],
[/>/, '>'],
[/"/, '"'],
[/&/, '&']
];
const htmlEscaper = new UnionReplacer(htmlEscapes);
const toBeHtmlEscaped = '<script>alert("inject & control")</script>';
console.log(htmlEscaper.replace(toBeHtmlEscaped));
Output:
<script>alert("inject & control")</script>
Simple Markdown highlighter
Highlighting Markdown special characters while preserving code blocks and spans.
Only a subset of Markdown syntax is supported for simplicity.
const mdHighlighter = new UnionReplacer([
[/^(`{3,}).*\n([\s\S]*?)(^\1`*\s*?$|\Z)/, (match, fence1, pre, fence2) => {
let block = `<b>${fence1}</b><br />\n`
block += `<pre>${htmlEscaper.replace(pre)}</pre><br />\n`
block += `<b>${fence2}</b>`
return block;
}],
[/(^|[^`])(`+)(?!`)(.*?[^`]\2)(?!`)/, (match, lead, delim, code) => {
return `${htmlEscaper.replace(lead)}<code>${htmlEscaper.replace(code)}</code>`
}],
[/[*~=+_-`]+/, '<b>$&</b>'],
[/\n/, '<br />\n']
].concat(htmlEscapes));
const toBeMarkdownHighlighted = '\
**Markdown** code to be "highlighted"\n\
with special care to fenced code blocks:\n\
````\n\
_Markdown_ within fenced code blocks is not *processed*:\n\
```\n\
Even embedded "fence strings" work well with **UnionEscaper**\n\
```\n\
````\n\
*CommonMark is sweet & cool.*';
console.log(mdHighlighter.replace(toBeMarkdownHighlighted));
Produces:
<b>**</b>Markdown<b>**</b> code to be "highlighted"<br />
with special care to fenced code blocks:<br />
<b>````</b><br />
<pre>_Markdown_ within fenced code blocks is not *processed*:
```
Even embedded "fence strings" work well with **UnionEscaper**
```
</pre><br />
<b>````</b><br />
<b>*</b>CommonMark is sweet & cool.<b>*</b>
Conservative markdown escaping
The code below escapes text, so that special Markdown sequences are
protected from interpreting. Two considerations are applied:
- Avoid messing the output with too many unnecessary escapings.
- GFM autolinks are a special case, as escaping the special chars in them
would cripple the result of rendering. We need to detect them and keep
them untouched.
const mdEscaper = new UnionReplacer([
[/\bhttps?:\/\/(?!\.)(?:\.?[\w-]+)+(?:[^\s<]*?)(?=[?!.,:*~]*(?:\s|$))/, '$&'],
[/[\\*_[\]`&<>]/, '\\$&'],
[/^(?:~~~|=+)/, '\\$&'],
[/~+/, m => m.length == 2 ? `\\${m}` : m],
[/^(?:[-+]|#{1,6})(?=\s)/, '\\$&'],
[/^(\d+)\.(?=\s)/, '$1\\. ']
]);
const toBeMarkdownEscaped = '\
A five-*starred* escaper:\n\
1. Would preserve _underscored_ in the http://example.com/_underscored_/ URL.\n\
2. Would also preserve backspaces (\\) in http://example.com/\\_underscored\\_/.';
console.log(mdEscaper.replace(toBeMarkdownEscaped));
Produces:
A five-\*starred\* escaper:
1\. Would preserve \_underscored\_ in the http://example.com/_underscored_/ URL.
2\. Would also preserve backspaces (\\) in http://example.com/\_underscored\_/.
Background
The library has been created to support complex text processing in situations
when certain configurability is desired.
The initial need occured when using the Turndown
project. It is a an excellent and flexible tool, but we faced several hard-to-solve
difficulties with escaping special sequences.
Without UnionReplacer
When text processing with several patterns is required, there are two approaches:
- Iterative processing of the full text, such as
return unsafe
.replace(/&/g, '&')
.replace(/</g, '<')
.replace(/>/g, '>')
.replace(/"/g, '"')
The issue is not only the performance. Since the subsequent replacements are
performed on a partially-processed result, the developer has to ensure that
no intermediate steps affect the processing. E.g.:
return 'a "tricky" task'
.replace(/"/g, '"')
.replace(/&/g, '&')
So 'a "tricky" task' became 'a "tricky" task'. This
particular task is manageable with carefuly choosing the processing order.
But when the processing is context-dependent, iterative processing becomes
impossible. - One-pass processing using regexp with alternations, which is correct, but
it might easily become overly complex, hard to read and hard to manage. As
one can see, the result seems pretty static and very fragile in terms of
keeping track of all the individual capture groups:
const mdHighlightRe = /(^(`{3,}).*\n([\s\S]*?)(^\2`*\s*?$|\Z))|((^|[^`])(`+)(?!`)(.*?[^`]\7)(?!`))|([*~=+_-`]+)|(\n)|(<)|(>)|(")|(&)/gm
return md.replace(mdHighlightRe,
(match, fenced, fence1, pre, fence2, codespan, lead, delim, code, special, nl, lt, gt, quot, amp) => {
if (fenced) {
let block = `<b>${fence1}</b><br />\n`
block += `<pre>${htmlEscaper.replace(pre)}</pre><br />\n`
block += `<b>${fence2}</b>`
return block;
} else if (codespan) {
return `${myHtmlEscape(lead)}<code>${myHtmlEscape.replace(code)}</code>`
} else if (special) {
return `<b>${special}</b>`
} else if (nl) {
return '<br />\n'
}
});
Introducing UnionReplacer
Iterative processing is simple and well-readable, though it is very limited.
Developers are often trading simplicity for bugs.
While regexp with alternations is the way to go, we wanted to provide an easy
way to build it, use it and even allow its variable composition in runtime.
Instead of using a single long regular regexp, developers can use an array
of individual smaller regexps, which will be merged together by the
UnionReplacer
class. Its usage is as simple as in the iterative processing
approach.
Features
- Fast. The processing is one-pass and native regexps are used. There might
be a tiny resource penalty when initially constructing the internal
compound regexp.
- Supports regexp backreferences. Backreferences in the compound regexp are
renumbered, so that the user does not have to care about it.
- Supports also ES2018 named capture group. See limitations.
- You can reuse everything used with
String.prototype.replace()
, namely:
- String replacements work the very same.
- Function replacements work the same with just a subtle difference for
ES2018 named capture groups.
- Standard regexp alternation semantics. The first replace that matches
consumes the match from input, no matter how long the match is. An example
follows.
Alternation semantics
const replacer1 = new UnionReplacer([
[/foo/, '(FOO)'],
[/.+/, '(nonfoo)']
]);
const replacer2 = new UnionReplacer([
[/foo/, '(FOO)'],
[/.+?(?=foo|$)/, '(nonfoo)']
]);
const text = 'foobarfoobaz'
replacer1.replace(text);
replacer2.replace(text);
Performance
Most important, the code was written with performance in mind.
In runtime, UnionReplacer
performs one-pass processing driven by
a single native regexp.
The replacements are always done as an arrow function internally, even for
string replacements. The eventual performance impact of this would be
engine-dependent.
Feel free to benchmark the library and please share the results.
Limitations
Named capture groups
ES2018 named capture groups work with the following limitations:
- Replacement functions are always provided with all the named captures, i.e. not limited to the matched rule.
- Capture group names must be unique amongst all capture rules.
Octal escapes
Not supported. The syntax is the same as backreferences (\1
) and
their interpretation is input-dependent even in native regexps.
It is better to avoid them completely and use hex escapes instead (\xNN
).
Regexp flags
Any flags in paticular search regexps are ignored.
The resulting replacement has always the flags from constructor call,
which defaults to global (g
) and multiline (m
).