Security News
Weekly Downloads Now Available in npm Package Search Results
Socket's package search now displays weekly downloads for npm packages, helping developers quickly assess popularity and make more informed decisions.
tiny-html-lexer
Advanced tools
A tiny standard compliant HTML5 lexer and tokeniser. The minified bundle is currently 6.9k bytes. Its small size should make it ideal for client side usage.
The chunker preserves all input characters, so it is suitable for building a syntax highlighter or html editor on top of it as well. It is lazy/ on demand, so it does not unnecessarily buffer chunks. You can see a simple example/ of it running in the browser here.
I would love for someone to build a tiny template language with it. Feel free to contact me with any questions.
The tiny-html-lexer module exposes two top level generator functions:
chunks, aka. lexemes
let tinyhtml = require ('tiny-html-lexer')
let stream = tinyhtml.chunks ('<span>Hello, world</span>')
for (let chunk of stream)
console.log (chunk)
Likewise, tags, aka. tokens
let stream = tags ('<span>Hello, world</span>')
for (let token of stream)
console.log (token)
⚠️ Only a very limited number of named character references are supported by the token builder (i.e. tags
parser), most of all because naively adding a map of entities would increase the code size about ten times, so I am thinking about a way to compress them.
However, you can supply your own decoder to the tags function, by passing an options argument as follows:
function parseNamedCharRef (string) {
return string in myEntityMap ? myEntityMap [string] : string
}
let stream = tags ('<span>Hello, world</span>', { parseNamedCharRef })
for (let token of stream)
console.log (token)
Note that the input is not always a known HTML named character reference. It does always start with &
. It typically includes the terminating ;
character. However, the semicolon is not added to any non-terminated legacy named character references in the HTML source.
You can access the chunks lexer state as follows:
let stream = tinyhtml.chunks ('<span>Hello, world</span>')
console.log (stream.state) // state before
for (let chunk of stream) {
console.log (chunk)
console.log (stream.state) // state after last seen chunk
}
This is similar for tags, as follows. Note that this returns the state of the underlying chunks lexer.
let stream = tinyhtml.tags ('<span>Hello, world</span>')
console.log (stream.state) // lexer state before
for (let chunk of stream) {
console.log (chunk)
console.log (stream.state) // lexer state after last seen chunk
}
Chunks are produced by the chunks generator function. A chunk is a pair, i.e. an array [type, data] where type is a string and data is a chunk of the input string.
The type is one of:
"attributeName"
"attributeAssign"
"attributeValueStart"
"attributeValueData"
"attributeValueEnd"
"tagSpace"
"commentStart"
"commentStartBogus"
"commentData"
"commentEnd"
"commentEndBogus"
"startTagStart"
"endTagStart"
"tagEnd"
"tagEndClose"
"charRefDecimal"
"charRefHex"
"charRefNamed"
"unescaped"
"data"
"newline"
"rcdata"
"rawtext"
"plaintext"
The generator returned from the chunks function has a property state that provides access to the lexer state. This can be used to annotate chunks with source positions if needed.
The word token has a specific meaning in the HTML5 standard.
Tokens are more abstract than chunks.
A 'Token' is a plain string, or an object that is an instance of
StartTag, EndTag, Whitespace, Comment or BogusComment.
tinyhtml.tags
."newline"
chunks."newline"
chunks.tinyhtml.tags (string)
to get a lazy stream (an iterator) of tag objects and data strings.attribute-equals
has been renamed to attribute-assign
.tokens
to tokenTypes
.The idea is that the lexical grammar can be very compactly expressed by a state machine that has transitions labeled with regular expressions rather than individual characters.
I am using regular expressions without capture groups for the transitions. For each state, the outgoing transitions are then wrapped in parentheses to create a capture group and then are all joined together as alternates in a single regular expression per state. When this regular expression is executed, one can then check which transition was taken by checking which index in the result of regex.exec is present.
MIT.
Enjoy!
FAQs
A tiny HTML5 lexer
We found that tiny-html-lexer demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Socket's package search now displays weekly downloads for npm packages, helping developers quickly assess popularity and make more informed decisions.
Security News
A Stanford study reveals 9.5% of engineers contribute almost nothing, costing tech $90B annually, with remote work fueling the rise of "ghost engineers."
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.