Security News
38% of CISOs Fear They’re Not Moving Fast Enough on AI
CISOs are racing to adopt AI for cybersecurity, but hurdles in budgets and governance may leave some falling behind in the fight against cyber threats.
TagSoup is the fastest pure JS SAX/DOM XML/HTML parser.
npm install --save-prod tag-soup
⚠️ API documentation is available here.
import {createSaxParser} from 'tag-soup';
// Or use
// import {createXmlSaxParser, createHtmlSaxParser} from 'tag-soup';
const saxParser = createSaxParser({
startTag(token) {
console.log(token); // → {tokenType: 1, name: 'foo', …}
},
endTag(token) {
console.log(token); // → {tokenType: 101, data: 'okay', …}
},
});
saxParser.parse('<foo>okay');
SAX parser invokes callbacks during parsing.
Callbacks receive tokens which represent structures read from the input. Tokens are pooled objects so when handler callback finishes they are returned to the pool and reused. Object pooling drastically reduces memory consumption and allows passing a lot of data to the callback.
If you need to retain token after callback finishes use
token.clone()
which returns the deep copy of
the token.
startTag
and endTag
callbacks are always invoked in the correct order even if tags in the input were incorrectly
nested or missed.
For self-closing tags only
startTag
callback in invoked.
All SAX parser factories accept two arguments
the handler with callbacks and
options. The most generic parser factory
createSaxParser
doesn't have any defaults.
For createXmlSaxParser
defaults are
xmlParserOptions
:
For createHtmlSaxParser
defaults are
htmlParserOptions
:
p
, li
, td
and others follow implicit end rules, so <p>foo<p>bar
is parsed as <p>foo</p><p>bar</p>
;You can alter how the parser works through options which give you fine-grained control over parsing dialect.
By default, TagSoup uses speedy-entites
to decode XML and HTML
entities. Parser created by createHtmlSaxParser
decodes only legacy HTML entities. This is done to reduce the bundle
size.
To decode all HTML entities use this snippet below. It would add 10 kB gzipped to the bundle size.
import {decodeHtml} from 'speedy-entities/lib/full';
const htmlParser = createHtmlSaxParser({
decodeText: decodeHtml,
decodeAttribute: decodeHtml,
});
With speedy-entites
you can create a custom decoder
that would recognize custom entities.
aacute
Aacute
acirc
Acirc
acute
aelig
AElig
agrave
Agrave
amp
AMP
aring
Aring
atilde
Atilde
auml
Auml
brvbar
ccedil
Ccedil
cedil
cent
copy
COPY
curren
deg
divide
eacute
Eacute
ecirc
Ecirc
egrave
Egrave
eth
ETH
euml
Euml
frac12
frac14
frac34
gt
GT
iacute
Iacute
icirc
Icirc
iexcl
igrave
Igrave
iquest
iuml
Iuml
laquo
lt
LT
macr
micro
middot
nbsp
not
ntilde
Ntilde
oacute
Oacute
ocirc
Ocirc
ograve
Ograve
ordf
ordm
oslash
Oslash
otilde
Otilde
ouml
Ouml
para
plusmn
pound
quot
QUOT
raquo
reg
REG
sect
shy
sup1
sup2
sup3
szlig
thorn
THORN
times
uacute
Uacute
ucirc
Ucirc
ugrave
Ugrave
uml
uuml
Uuml
yacute
Yacute
yen
yuml
SAX parsers support streaming. You can use
saxParser.write(chunk)
to parse input data
chunk by chunk.
const saxParser = createSaxParser({/*callbacks*/});
saxParser.write('<foo>ok');
// Triggers startTag callabck for "foo" tag.
saxParser.write('ay');
// Doesn't trigger any callbacks.
saxParser.write('</foo>');
// Triggers text callback for "okay" and endTag callback for "foo" tag.
import {createDomParser} from 'tag-soup';
// Or use
// import {createXmlDomParser, createHtmlDomParser} from 'tag-soup';
// Minimal DOM handler example
const domParser = createDomParser<any>({
element(token) {
return {tagName: token.name, children: []};
},
appendChild(parentNode, node) {
parentNode.children.push(node);
},
});
const domNode = domParser.parse('<foo>okay');
console.log(domNode[0].children[0].data); // → 'okay'
DOM parser assembles a node three using a handler that describes how nodes are created and appended.
The generic parser factory createDomParser
requires a handler to be provided.
Both createXmlDomParser
and
createHtmlDomParser
use
domHandler
if no other handler was provided and use
default options (xmlParserOptions
and htmlParserOptions
respectively) which
can be overridden.
DOM parsers support streaming. You can use
domParser.write(chunk)
to parse input data
chunk by chunk.
const domParser = createXmlDomParser();
domParser.write('<foo>ok');
// → [{nodeType: 1, tagName: 'foo', children: [], …}]
domParser.write('ay');
// → [{nodeType: 1, tagName: 'foo', children: [], …}]
domParser.write('</foo>');
// → [{nodeType: 1, tagName: 'foo', children: [{nodeType: 3, data: 'okay', …}], …}]
To run a performance test use npm ci && npm run build && npm run perf
.
Performance was measured when parsing the 3.81 MB HTML file.
Results are in operations per second. The higher number is better.
Ops/sec | |
---|---|
createSaxParser ¹ | 36.3 ± 0.8% |
createXmlSaxParser ¹ | 30.7 ± 0.5% |
createHtmlSaxParser ¹ | 23.7 ± 0.5% |
createSaxParser | 29.2 ± 0.5% |
createXmlSaxParser | 26.1 ± 0.5% |
createHtmlSaxParser | 19.9 ± 0.5% |
@fb55/htmlparser2 | 14.3 ± 0.5% |
@isaacs/sax-js | 1.7 ± 4.6% |
¹ Parsers were provided a handler with a single
text
callback. This configuration can be
useful if you want to strip tags from the input.
Ops/sec | |
---|---|
createDomParser | 13.7 ± 0.5% |
createXmlDomParser | 12.6 ± 0.5% |
createHtmlDomParser | 10.6 ± 0.5% |
@fb55/htmlparser2 | 8.4 ± 0.5% |
@inikulin/parse5 | 2.8 ± 0.7% |
The performance was measured when parsing
258 files with 95 kB in size on average from
htmlparser-benchmark
.
Results are in operations per second. The higher number is better.
Ops/sec | |
---|---|
createSaxParser | 1 998.0 ± 0.1% |
createXmlSaxParser | 1 734.1 ± 0.1% |
createHtmlSaxParser | 1 285.4 ± 0.1% |
@fb55/htmlparser2 | 717.5 ± 0.2% |
Ops/sec | |
---|---|
createDomParser | 1 087.1 ± 0.2% |
createXmlDomParser | 853.5 ± 0.2% |
createHtmlDomParser | 668.0 ± 0.2% |
@fb55/htmlparser2 | 457.7 ± 0.2% |
@inikulin/parse5 | 50.8 ± 0.4% |
TagSoup doesn't resolve some weird element structures that malformed HTML may cause.
For example, assume the following markup:
<p><strong>okay
<p>nope
With DOMParser
this markup would be transformed to:
<p><strong>okay</strong></p>
<p><strong>nope</strong></p>
TagSoup doesn't insert the second strong
tag:
<p><strong>okay</strong></p>
<p>nope</p> <!-- Note the absent "strong" tag -->
FAQs
The fastest pure JS SAX/DOM XML/HTML parser.
The npm package tag-soup receives a total of 0 weekly downloads. As such, tag-soup popularity was classified as not popular.
We found that tag-soup demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
CISOs are racing to adopt AI for cybersecurity, but hurdles in budgets and governance may leave some falling behind in the fight against cyber threats.
Research
Security News
Socket researchers uncovered a backdoored typosquat of BoltDB in the Go ecosystem, exploiting Go Module Proxy caching to persist undetected for years.
Security News
Company News
Socket is joining TC54 to help develop standards for software supply chain security, contributing to the evolution of SBOMs, CycloneDX, and Package URL specifications.