A tiny HTML5 lexer
A tiny standard compliant HTML5 lexer/ chunker.
The minified bundle is currently 6.3k bytes.
Its small size should make it ideal for client side usage.
The chunker preserves all input characters, so it is suitable for building a syntax highlighter or html editor on top of it as well, if you like.
It is lazy/ on demand, so it does not unnecessarily buffer chunks.
You can see a simple example/ of it running in the browser here.
I would love for someone to build a tiny template language with it.
Feel free to contact me with any questions.
Api
Two top level generator functions, chunks (input)
, and tags (input)
.
let tinyhtml = require ('tiny-html-lexer')
let stream = tinyhtml.chunks ('<span>Hello, world</span>')
for (let chunk of stream)
console.log (chunk)
Likewise:
let stream = tinyhtml.tags ('<span>Hello, world</span>')
for (let tag of stream)
console.log (tag)
You can access the lexer state as follows.
(API may change a bit, still).
let stream = tinyhtml.chunks ('<span>Hello, world</span>')
console.log (stream.state)
for (let chunk of stream) {
console.log (chunk)
console.log (stream.state)
}
Chunks
Chunks are tuples (arrays) [type, data]
where type
is a string
and data
is a chunk of the input string.
type
is one of:
"attributeName"
"attributeAssign"
"attributeValueStart"
"attributeValueData"
"attributeValueEnd"
"tagSpace"
"commentStart"
"commentStartBogus"
"commentData"
"commentEnd"
"commentEndBogus"
"startTagStart"
"endTagStart"
"tagEnd"
"tagEndClose"
"charRefDecimal"
"charRefHex"
"charRefNamed"
"unescaped"
"data"
"newline"
"rcdata"
"rawtext"
"plaintext"
Tags
These are called 'tokens' in the HTML5 standard.
A 'Tag' is either a plain string, or an object that is an instance of StartTag
, EndTag
or Comment
.
Limitations
- Doctype tokens are preserved, but are parsed as bogus comments rather than as doctype tokens.
- CData (only used in svg/ foreign content) is likewise parsed as bogus comments.
- Only a very limted number of named entities are supported by the token builder (i.e.
tags
parser), most of all because naively adding a map of entities would increase the code size about ten times, so I am still thinking about a way to compress them. (Feel free to contact me if you need this).
Changelog
0.9.1
- The token builder now lowercases attribute names and handles duplicate attributes according to the standard (the first value is preserved).
- Some preliminary work has been done to emit newlines as separate
"newline"
chunks.
0.9.0
- Rewrote the lexer runtime.
- Added a token builder! Use
tinyhtml.tags (string)
to get a lazy stream (an iterator) of tag objects and data strings. - Disabled the typescript annotations for the time being.
- The types have been renamed to use camelCase.
0.8.5
- Fix an issue introduced in version 0.8.4 where terminating semicolons after legacy character references would be tokenized as data.
0.8.4
- Correct handling of legacy (unterminated) named character references.
0.8.3
- Added typescript annotations.
- Token type
attribute-equals
has been renamed to attribute-assign
. - Renamed export
tokens
to tokenTypes
.
0.8.1
- Fix for incorrect parsing of slashes between attributes.
0.8.0
Some implementation details
The idea is that the lexical grammar can be very compactly expressed by
a state machine that has transitions labeled with regular expressions
rather than individual characters.
I am using regular expressions without capture groups for the transitions.
For each state, the outgoing transitions are then wrapped in parentheses to
create a capture group and then are all joined together as alternates in
a single regular expression per state. When this regular expression is
executed, one can then check which transition was taken by checking which
index in the result of regex.exec is present.
License
MIT.
Enjoy!