Research
Security News
Quasar RAT Disguised as an npm Package for Detecting Vulnerabilities in Ethereum Smart Contracts
Socket researchers uncover a malicious npm package posing as a tool for detecting vulnerabilities in Etherium smart contracts.
@bonniernews/atlas-html-stream
Advanced tools
A super fast html-parser stream that outputs tag, text and closing nodes.
Fork of atlas-html-stream which is a super fast html-parser stream that outputs tag,
text and closing nodes. See CHANGELOG.md
for changes introduced in the fork.
npm install --save @bonniernews/atlas-html-stream
I didn't like htmlparser2's streaming API and I wanted an html parser that collpased text-node whitespace by default. I also wanted to see if I could write a faster parser.
The following benchmark was done on my local machine using htmlparser-benchmark. Lower is better:
atlas-html-stream: 4.91208 ms/file ± 6.08846
htmlparser2: 7.06549 ms/file ± 5.09643
I only tested against htmlparser2 because it was the fastest stream-based parser. This benchmark uses the streaming interface of both parsers.
A simple parser requires the body to be in memory, which can be game-breaking if your HTML is very large. This parser is a stream, and benefits from using chunked data. The basic idea is to keep track of the current chunk and anything from the previous chunk which spills over into the current chunk. For example, consider the following two chunks:
"first data chunk <di"
"v>second chunk</div>"
When the parser is done with the first chunk, it will recognize that it needs to hold onto "<di"
and will only flush "first chunk "
from memory. When the parser is done with the second chunk, it will flush everything from memory since there's no longer a pending node.
Using this parser is really easy since it extends
the stream interface -- all you need to do is pipe or write HTML to it and listen to "data"
events:
The stream interface is recommended, because it allows you to consume a constant amount of memory (regardless of HTML size):
const { createReadStream } = require("fs");
const HtmlParser = require("@bonniernews/atlas-html-stream");
const myHtmlFile = createReadStream("./index.html");
myHtmlFile.pipe(new HtmlParser()).on("data", ({name, data, text}) => {
if (text){
// is a text node
console.log(text);
} else if (name && data){
// is an opening tag with potential attributes
console.log(name, data)
} else {
// is a closing tag
console.log(name)
}
}).on("end", () => {
console.log("parsed all html")
})
This is not recommended unless you can afford to have your HTML exist entirely in memory. First, we'll write a small helper to abstract away the streaming interface:
// helpers.js
const HtmlParser = require("@bonniernews/atlas-html-stream");
const parser = new HtmlParser();
const parseHtml = html => {
const nodes = [];
parser.on("data", data => nodes.push(data))
parser.write(html)
// resets the parser back to vanilla state
parser.reset();
return nodes;
}
module.exports = { parseHtml };
Next, we'll use our helper to parse our in-memory html file:
const { readFileSync } = require("fs");
const { parseHtml } = require("./helpers");
const file = readFileSync("./index.html");
const nodes = parseHtml(file);
// do what you want with the parsed nodes.
The examples above show you how minimal the API is, as it should be. Comment, script and style nodes are treated as regular nodes, in that they emit opening and closing tags. The main difference is that the content inside these nodes is treated as a single raw text node with no whitespace-collapse. For simplicity, let's use our parseHtml
helper function from above in the following examples:
...
const scriptNodes = parseHtml(`
<script src="./script.js">
const myVar = 5;
</script>
`);
scriptNodes.forEach(n => console.log(n))
// { name: 'script', data: { src: './script.js' } }
// { text: '\n const myVar = 5;\n ' }
// { name: 'script' }
Style tags are treated the same way, except their name
property has a value of "style"
.
Again, these are treated the same way.
const commentNodes = parseHtml(`
<!--
this is
a comment
-->
`)
commentNodes.forEach(n => console.log(n));
// { name: '!--', data: {} }
// { text: ' \n this is \n a comment \n ' }
// { name: '!--' }
By default, the parser will collapse all non-script/style/comment whitespace down to a single space, which is useful if you're scraping and don't care about whitespace. In some cases, you need the whitespace. You can pass a preserveWS
option to the constructor, which will force the parser to keep all whitespace it finds in text nodes:
const nodes = parseHtml(`
This is some text
<b>
Hola
</b>
`)
nodes.forEach(n => console.log(n));
// { text: "\n This is some text\n\n " }
// { name: "b", data: {} }
// { text: " \n Hola\n " }
// { name: "b" }
// { text: " \n" }
You can then post-process the nodes if you need more specificity. Note that this example doesn't use a streaming interface for simplicity, however, the streaming interface is highly recommended:
myFile.pipe(new HtmlParser({preserveWS: true})).pipe(myTransform).on("data", node => {...})
myTransform
can conditionally check text nodes and post-process them one-by-one.
parseHtml
helperShould parseHtml
be exported from this package in addition to HtmlParser
? Using parseHtml
is not recommended over the streaming interface, but it seems like a valid helper for cases where the html string needs to be in memory.
I'd like to make this thing even faster. The parsing itself takes about 3.5 ms/file
(using htmlparser-benchmark) on my machine. Pushing nodes as data
events to our stream adds around 40% more processing time, which is why the benchmark above shows around 4.9 ms/file
-- this can't be avoided, because we want the streaming interface.
The SeqMatcher
slows down this parser (checking comment, script and style nodes); there might be a faster way to handle these special nodes.
Switching on the state
of the parser first is probably faster than switching on the current character first, but I haven't tested the latter. The main idea is to minimize the amount of instructions required for each [state, char]
pair, according to how likely each pair is. If [TEXT, !whitespace]
has the highest probability in "typical" html, then we would want this pair to require the least amount of instructions. In other words, sum_i(probabilityOfPair_i*numSteps_i)
should be minimized.
I think collapsing the states SCRIPT
, STYLE
and COMMENT
into a single state SPECIAL
might improve the performance. Instead of polluting the space with SeqMatchers
for each one initially, we can create them on-demand for whichever SPECIAL
node we are parsing, and store them in a persistent hash in case we see the same SPECIAL
node again.
What defines a "typical" html document? A set of parsing instructions for a "typical" html document may be slower for "atypical" html documents which have a vastly different probabilityOfPair_i
distribution. What if the parser could use the distribution of pairs of the current document to dynamically change its condition-tree on-the-fly? For example, if we're getting overwhelmed by raw text, it would be faster to check if the state is TEXT
first. Alternatively, if our document has almost no raw text, it would be smarter to check if the state is TEXT
last. While the dynamic parser sounds interesting, it may not be worth implementing if it adds a ton of overhead.
Since the parser keeps pending nodes' data in memory, a large text node will cause memory issues. In the vast majority of cases, you will not run into text nodes bigger than the chunks themselves. If this becomes a problem, the parser could be modified to accept a maxTextSize
parameter which it could use to chunk the individual text nodes themselves. For now, this is an extreme edge case and isn't handled.
The doctype node is treated as any other node. The input "<!DOCTYPE html>"
would result in the following data event: {name: "!DOCTYPE", data: {html: ""}}
Tag attributes with no value are not treated in a special way; a key with no value will have an empty string value in the node's data field. I may implement true
as the default value, but this seems like something that can easily be done by the caller.
FAQs
A super fast html-parser stream that outputs tag, text and closing nodes.
The npm package @bonniernews/atlas-html-stream receives a total of 205 weekly downloads. As such, @bonniernews/atlas-html-stream popularity was classified as not popular.
We found that @bonniernews/atlas-html-stream demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 9 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket researchers uncover a malicious npm package posing as a tool for detecting vulnerabilities in Etherium smart contracts.
Security News
Research
A supply chain attack on Rspack's npm packages injected cryptomining malware, potentially impacting thousands of developers.
Research
Security News
Socket researchers discovered a malware campaign on npm delivering the Skuld infostealer via typosquatted packages, exposing sensitive data.