Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More →

tiny-html-lexer

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

tiny-html-lexer

A tiny HTML5 lexer

0.9.1
Source
npm

Version published: 4 years ago

Weekly downloads: 0; decreased by-100%

Maintainers: 1

Weekly downloads

Created: 7 years ago

Source

A tiny HTML5 lexer

A tiny standard compliant HTML5 lexer/ chunker. The minified bundle is currently 6.3k bytes. Its small size should make it ideal for client side usage.

The chunker preserves all input characters, so it is suitable for building a syntax highlighter or html editor on top of it as well, if you like.

It is lazy/ on demand, so it does not unnecessarily buffer chunks. You can see a simple example/ of it running in the browser here.

I would love for someone to build a tiny template language with it. Feel free to contact me with any questions.

Api

Two top level generator functions, chunks (input), and tags (input).

let tinyhtml = require ('tiny-html-lexer')
let stream = tinyhtml.chunks ('<span>Hello, world</span>')
for (let chunk of stream)
  console.log (chunk)

Likewise:

let stream = tinyhtml.tags ('<span>Hello, world</span>')
for (let tag of stream)
  console.log (tag)

You can access the lexer state as follows.
(API may change a bit, still).

let stream = tinyhtml.chunks ('<span>Hello, world</span>')
console.log (stream.state) // state before
for (let chunk of stream) {
  console.log (chunk)
  console.log (stream.state) // state after last seen chunk 
}

Chunks

Chunks are tuples (arrays) [type, data] where type is a string and data is a chunk of the input string.

type is one of:

"attributeName"
"attributeAssign"
"attributeValueStart"
"attributeValueData"
"attributeValueEnd"
"tagSpace"
"commentStart"
"commentStartBogus"
"commentData"
"commentEnd"
"commentEndBogus"
"startTagStart"
"endTagStart"
"tagEnd"
"tagEndClose"
"charRefDecimal"
"charRefHex"
"charRefNamed"
"unescaped"
"data"
"newline"
"rcdata"
"rawtext"
"plaintext"

Limitations

Doctype tokens are preserved, but are parsed as bogus comments rather than as doctype tokens.
CData (only used in svg/ foreign content) is likewise parsed as bogus comments.
Only a very limted number of named entities are supported by the token builder (i.e. tags parser), most of all because naively adding a map of entities would increase the code size about ten times, so I am still thinking about a way to compress them. (Feel free to contact me if you need this).

Changelog

0.9.1

The token builder now lowercases attribute names and handles duplicate attributes according to the standard (the first value is preserved).
Some preliminary work has been done to emit newlines as separate "newline" chunks.

0.9.0

Rewrote the lexer runtime.
Added a token builder! Use tinyhtml.tags (string) to get a lazy stream (an iterator) of tag objects and data strings.
Disabled the typescript annotations for the time being.
The types have been renamed to use camelCase.

0.8.5

Fix an issue introduced in version 0.8.4 where terminating semicolons after legacy character references would be tokenized as data.

0.8.4

Correct handling of legacy (unterminated) named character references.

0.8.3

Added typescript annotations.
Token type attribute-equals has been renamed to attribute-assign.
Renamed export tokens to tokenTypes.

0.8.1

Fix for incorrect parsing of slashes between attributes.

0.8.0

First public release.

Some implementation details

The idea is that the lexical grammar can be very compactly expressed by a state machine that has transitions labeled with regular expressions rather than individual characters.

I am using regular expressions without capture groups for the transitions. For each state, the outgoing transitions are then wrapped in parentheses to create a capture group and then are all joined together as alternates in a single regular expression per state. When this regular expression is executed, one can then check which transition was taken by checking which index in the result of regex.exec is present.

License

MIT.

Enjoy!

Keywords

FAQs

What is tiny-html-lexer?

Is tiny-html-lexer popular?

Is tiny-html-lexer well maintained?

Package last updated on 04 Dec 2020

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

tiny-html-lexer

A tiny HTML5 lexer

Api

Chunks

Tags

Limitations

Changelog

0.9.1

0.9.0

0.8.5

0.8.4

0.8.3

0.8.1

0.8.0

Some implementation details

License

Keywords

Related posts

tiny-html-lexer

A tiny HTML5 lexer

Api

Chunks

Tags

Limitations

Changelog

0.9.1

0.9.0

0.8.5

0.8.4

0.8.3

0.8.1

0.8.0

Some implementation details

License

Keywords

Related posts

Weekly Downloads Now Available in npm Package Search Results

Tech's $90B Ghost Engineer Problem: Stanford Study Finds 9.5% of Engineers Do Almost Nothing