Research
Security News
Quasar RAT Disguised as an npm Package for Detecting Vulnerabilities in Ethereum Smart Contracts
Socket researchers uncover a malicious npm package posing as a tool for detecting vulnerabilities in Etherium smart contracts.
simple-html-tokenizer
Advanced tools
Simple HTML Tokenizer is a lightweight JavaScript library that can be used to tokenize the kind of HTML normally found in templates.
The simple-html-tokenizer npm package is a lightweight library designed to tokenize HTML strings. It breaks down HTML content into a stream of tokens, which can be useful for parsing, analyzing, or transforming HTML documents.
Tokenizing HTML
This feature allows you to tokenize an HTML string into a series of tokens. The `tokenize` function takes an HTML string as input and returns an array of tokens representing the different parts of the HTML.
const { tokenize } = require('simple-html-tokenizer');
const html = '<div>Hello, <span>world!</span></div>';
const tokens = tokenize(html);
console.log(tokens);
Handling different token types
This feature demonstrates how to handle different types of tokens produced by the tokenizer. The tokens can be of various types such as 'StartTag', 'EndTag', and 'Chars', and this code sample shows how to process each type accordingly.
const { tokenize } = require('simple-html-tokenizer');
const html = '<div>Hello, <span>world!</span></div>';
const tokens = tokenize(html);
tokens.forEach(token => {
switch (token.type) {
case 'StartTag':
console.log('Start tag:', token.tagName);
break;
case 'EndTag':
console.log('End tag:', token.tagName);
break;
case 'Chars':
console.log('Text:', token.chars);
break;
default:
console.log('Other token:', token);
}
});
htmlparser2 is a fast and forgiving HTML/XML parser. It is more feature-rich compared to simple-html-tokenizer, offering a complete DOM structure and event-based parsing, which makes it suitable for more complex parsing tasks.
parse5 is a highly compliant HTML parser that produces a DOM tree. It is designed to be fully compatible with the HTML5 specification, making it more robust for handling modern web content compared to simple-html-tokenizer.
html-tokenize is another library for tokenizing HTML. It is similar in functionality to simple-html-tokenizer but offers a different API and may have different performance characteristics.
Simple HTML Tokenizer is a lightweight JavaScript library that can be used to tokenize the kind of HTML normally found in templates. It can be used to preprocess templates to change the behavior of some template element depending upon whether the template element was found in an attribute or text.
It is not a full HTML5 tokenizer. It focuses on the kind of HTML that is
used in templates: content designed to be inserted into the <body>
and without <script>
tags.
In particular, Simple HTML Tokenizer does not handle many states from the HTML5 Tokenizer Specification:
CDATA
or RCDATA
<script>
<DOCTYPE>
It also passes through character references, instead of trying to tokenize and process them, because the preprocessed templates will ultimately be parsed by a real browser context.
At the moment, there are some error states specified by the tokenizer spec that are not handled by Simple HTML Tokenizer. Ultimately, I plan to support all error states, as well as provide information about tokenizer errors in debug mode.
You can tokenize HTML:
var tokens = HTML5Tokenizer.tokenize("<div id='foo' href=bar class=\"bat\">");
var token = tokens[0];
token.tagName //=> "div"
token.attributes //=> [["id", "foo"], ["href", "bar"], ["class", "bat"]]
token.selfClosing //=> false
npm install
npm test
FAQs
Simple HTML Tokenizer is a lightweight JavaScript library that can be used to tokenize the kind of HTML normally found in templates.
The npm package simple-html-tokenizer receives a total of 379,662 weekly downloads. As such, simple-html-tokenizer popularity was classified as popular.
We found that simple-html-tokenizer demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 6 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket researchers uncover a malicious npm package posing as a tool for detecting vulnerabilities in Etherium smart contracts.
Security News
Research
A supply chain attack on Rspack's npm packages injected cryptomining malware, potentially impacting thousands of developers.
Research
Security News
Socket researchers discovered a malware campaign on npm delivering the Skuld infostealer via typosquatted packages, exposing sensitive data.