
Security News
CISA Kills Off RSS Feeds for KEVs and Cyber Alerts
CISA is discontinuing official RSS support for KEV and cybersecurity alerts, shifting updates to email and social media, disrupting automation workflows.
Straightforward HTML parser for JavaScript. Live Demo.
npm install hyntax
const { tokenize, constructTree } = require('hyntax')
const util = require('util')
const inputHTML = `
<html>
<body>
<input type="text" placeholder="Don't type">
<button>Don't press</button>
</body>
</html>
`
const { tokens } = tokenize(inputHTML)
const { ast } = constructTree(tokens)
console.log(JSON.stringify(tokens, null, 2))
console.log(util.inspect(ast, { showHidden: false, depth: null }))
Hyntax is written in JavaScript but has integrated TypeScript typings to help you navigate around its data structures. There is also Types Reference which covers most common types.
Use StreamTokenizer
and StreamTreeConstructor
classes to parse HTML chunk by chunk while it's still being loaded from the network or read from the disk.
const { StreamTokenizer, StreamTreeConstructor } = require('hyntax')
const http = require('http')
const util = require('util')
http.get('http://info.cern.ch', (res) => {
const streamTokenizer = new StreamTokenizer()
const streamTreeConstructor = new StreamTreeConstructor()
let resultTokens = []
let resultAst
res.pipe(streamTokenizer).pipe(streamTreeConstructor)
streamTokenizer
.on('data', (tokens) => {
resultTokens = resultTokens.concat(tokens)
})
.on('end', () => {
console.log(JSON.stringify(resultTokens, null, 2))
})
streamTreeConstructor
.on('data', (ast) => {
resultAst = ast
})
.on('end', () => {
console.log(util.inspect(resultAst, { showHidden: false, depth: null }))
})
}).on('error', (err) => {
throw err;
})
Here are all kinds of tokens which Hyntax will extract out of HTML string.
Each token conforms to Tokenizer.Token interface.
Resulting syntax tree will have at least one top-level Document Node with optional children nodes nested within.
{
nodeType: TreeConstructor.NodeTypes.Document,
content: {
children: [
{
nodeType: TreeConstructor.NodeTypes.AnyNodeType,
content: {…}
},
{
nodeType: TreeConstructor.NodeTypes.AnyNodeType,
content: {…}
}
]
}
}
Content of each node is specific to node's type, all of them are described in AST Node Types reference.
Hyntax has its tokenizer as a separate module. You can use generated tokens on their own or pass them further to a tree constructor to build an AST.
tokenize(html: String): Tokenizer.Result
html
After you've got an array of tokens, you can pass them into tree constructor to build an AST.
constructTree(tokens: Tokenizer.AnyToken[]): TreeConstructor.Result
tokens
interface Result {
state: Tokenizer.State
tokens: Tokenizer.AnyToken[]
}
state
tokens
interface Result {
state: State
ast: AST
}
state
The current state of the tree constructor. Can be persisted and passed to the next tree constructor call in case when tokens are coming in chunks.
ast
Resulting AST.
Type: TreeConstructor.AST
Generic Token, other interfaces use it to create a specific Token type.
interface Token<T extends TokenTypes.AnyTokenType> {
type: T
content: string
startPosition: number
endPosition: number
}
type
One of the Token types.
content
Piece of original HTML string which was recognized as a token.
startPosition
Index of a character in the input HTML string where the token starts.
endPosition
Index of a character in the input HTML string where the token ends.
Shortcut type of all possible tokens.
type AnyTokenType =
| Text
| OpenTagStart
| AttributeKey
| AttributeAssigment
| AttributeValueWrapperStart
| AttributeValue
| AttributeValueWrapperEnd
| OpenTagEnd
| CloseTag
| OpenTagStartScript
| ScriptTagContent
| OpenTagEndScript
| CloseTagScript
| OpenTagStartStyle
| StyleTagContent
| OpenTagEndStyle
| CloseTagStyle
| DoctypeStart
| DoctypeEnd
| DoctypeAttributeWrapperStart
| DoctypeAttribute
| DoctypeAttributeWrapperEnd
| CommentStart
| CommentContent
| CommentEnd
Shortcut to reference any possible token.
type AnyToken = Token<TokenTypes.AnyTokenType>
Just an alias to DocumentNode. AST always has one top-level DocumentNode. See AST Node Types
type AST = TreeConstructor.DocumentNode
There are 7 possible types of Node. Each type has a specific content.
type DocumentNode = Node<NodeTypes.Document, NodeContents.Document>
type DoctypeNode = Node<NodeTypes.Doctype, NodeContents.Doctype>
type TextNode = Node<NodeTypes.Text, NodeContents.Text>
type TagNode = Node<NodeTypes.Tag, NodeContents.Tag>
type CommentNode = Node<NodeTypes.Comment, NodeContents.Comment>
type ScriptNode = Node<NodeTypes.Script, NodeContents.Script>
type StyleNode = Node<NodeTypes.Style, NodeContents.Style>
Interfaces for each content type:
Generic Node, other interfaces use it to create specific Nodes by providing type of Node and type of the content inside the Node.
interface Node<T extends NodeTypes.AnyNodeType, C extends NodeContents.AnyNodeContent> {
nodeType: T
content: C
}
Shortcut type of all possible Node types.
type AnyNodeType =
| Document
| Doctype
| Tag
| Text
| Comment
| Script
| Style
Shortcut type of all possible types of content inside a Node.
type AnyNodeContent =
| Document
| Doctype
| Text
| Tag
| Comment
| Script
| Style
interface Document {
children: AnyNode[]
}
interface Doctype {
start: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeStart>
attributes?: DoctypeAttribute[]
end: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeEnd>
}
interface Text {
value: Tokenizer.Token<Tokenizer.TokenTypes.Text>
}
interface Tag {
name: string
selfClosing: boolean
openStart: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagStart>
attributes?: TagAttribute[]
openEnd: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagEnd>
children?: AnyNode[]
close: Tokenizer.Token<Tokenizer.TokenTypes.CloseTag>
}
interface Comment {
start: Tokenizer.Token<Tokenizer.TokenTypes.CommentStart>
value: Tokenizer.Token<Tokenizer.TokenTypes.CommentContent>
end: Tokenizer.Token<Tokenizer.TokenTypes.CommentEnd>
}
interface Script {
openStart: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagStartScript>
attributes?: TagAttribute[]
openEnd: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagEndScript>
value: Tokenizer.Token<Tokenizer.TokenTypes.ScriptTagContent>
close: Tokenizer.Token<Tokenizer.TokenTypes.CloseTagScript>
}
interface Style {
openStart: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagStartStyle>,
attributes?: TagAttribute[],
openEnd: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagEndStyle>,
value: Tokenizer.Token<Tokenizer.TokenTypes.StyleTagContent>,
close: Tokenizer.Token<Tokenizer.TokenTypes.CloseTagStyle>
}
interface DoctypeAttribute {
startWrapper?: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeAttributeWrapperStart>,
value: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeAttribute>,
endWrapper?: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeAttributeWrapperEnd>
}
interface TagAttribute {
key?: Tokenizer.Token<Tokenizer.TokenTypes.AttributeKey>,
startWrapper?: Tokenizer.Token<Tokenizer.TokenTypes.AttributeValueWrapperStart>,
value?: Tokenizer.Token<Tokenizer.TokenTypes.AttributeValue>,
endWrapper?: Tokenizer.Token<Tokenizer.TokenTypes.AttributeValueWrapperEnd>
}
FAQs
Straightforward HTML parser for Node.js and browser
The npm package hyntax-yx receives a total of 721 weekly downloads. As such, hyntax-yx popularity was classified as not popular.
We found that hyntax-yx demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
CISA is discontinuing official RSS support for KEV and cybersecurity alerts, shifting updates to email and social media, disrupting automation workflows.
Security News
The MCP community is launching an official registry to standardize AI tool discovery and let agents dynamically find and install MCP servers.
Research
Security News
Socket uncovers an npm Trojan stealing crypto wallets and BullX credentials via obfuscated code and Telegram exfiltration.