What is simple-html-tokenizer?
The simple-html-tokenizer npm package is a lightweight library designed to tokenize HTML strings. It breaks down HTML content into a stream of tokens, which can be useful for parsing, analyzing, or transforming HTML documents.
What are simple-html-tokenizer's main functionalities?
Tokenizing HTML
This feature allows you to tokenize an HTML string into a series of tokens. The `tokenize` function takes an HTML string as input and returns an array of tokens representing the different parts of the HTML.
const { tokenize } = require('simple-html-tokenizer');
const html = '<div>Hello, <span>world!</span></div>';
const tokens = tokenize(html);
console.log(tokens);
Handling different token types
This feature demonstrates how to handle different types of tokens produced by the tokenizer. The tokens can be of various types such as 'StartTag', 'EndTag', and 'Chars', and this code sample shows how to process each type accordingly.
const { tokenize } = require('simple-html-tokenizer');
const html = '<div>Hello, <span>world!</span></div>';
const tokens = tokenize(html);
tokens.forEach(token => {
switch (token.type) {
case 'StartTag':
console.log('Start tag:', token.tagName);
break;
case 'EndTag':
console.log('End tag:', token.tagName);
break;
case 'Chars':
console.log('Text:', token.chars);
break;
default:
console.log('Other token:', token);
}
});
Other packages similar to simple-html-tokenizer
htmlparser2
htmlparser2 is a fast and forgiving HTML/XML parser. It is more feature-rich compared to simple-html-tokenizer, offering a complete DOM structure and event-based parsing, which makes it suitable for more complex parsing tasks.
parse5
parse5 is a highly compliant HTML parser that produces a DOM tree. It is designed to be fully compatible with the HTML5 specification, making it more robust for handling modern web content compared to simple-html-tokenizer.
html-tokenize
html-tokenize is another library for tokenizing HTML. It is similar in functionality to simple-html-tokenizer but offers a different API and may have different performance characteristics.
Simple HTML Tokenizer
Simple HTML Tokenizer is a lightweight JavaScript library that can be
used to tokenize the kind of HTML normally found in templates. It can be
used to preprocess templates to change the behavior of some template
element depending upon whether the template element was found in an
attribute or text.
It is not a full HTML5 tokenizer. It focuses on the kind of HTML that is
used in templates: content designed to be inserted into the <body>
and without <script>
tags.
In particular, Simple HTML Tokenizer does not handle many states from
the HTML5 Tokenizer Specification:
- Any states involving
CDATA
or RCDATA
- Any states involving
<script>
- Any states involving
<DOCTYPE>
- The bogus comment state
It also passes through character references, instead of trying to
tokenize and process them, because the preprocessed templates will
ultimately be parsed by a real browser context.
At the moment, there are some error states specified by the tokenizer
spec that are not handled by Simple HTML Tokenizer. Ultimately, I plan
to support all error states, as well as provide information about
tokenizer errors in debug mode.
Usage
You can tokenize HTML:
var tokens = HTML5Tokenizer.tokenize("<div id='foo' href=bar class=\"bat\">");
var token = tokens[0];
token.tagName
token.attributes
token.selfClosing
Building and running the tests
npm install
npm test