Security News
Python Overtakes JavaScript as Top Programming Language on GitHub
Python becomes GitHub's top language in 2024, driven by AI and data science projects, while AI-powered security tools are gaining adoption.
The parse5 npm package is a fast full-featured specification-compliant HTML parser for Node.js. It allows users to parse HTML documents and manipulate the resulting document tree structure. The package provides a variety of modules for parsing, serializing, and tree adaptation based on the DOM (Document Object Model) interface.
Parsing HTML
This feature allows you to parse an HTML string into a document tree that can be manipulated or queried.
const parse5 = require('parse5');
const html = '<!DOCTYPE html><html><head></head><body>Hi there!</body></html>';
const document = parse5.parse(html);
Serializing Document
This feature enables you to serialize a document tree back into an HTML string.
const parse5 = require('parse5');
const document = parse5.parse('<!DOCTYPE html><html><head></head><body>Hi there!</body></html>');
const html = parse5.serialize(document);
Streaming
This feature allows you to use parse5 in a streaming mode, which is useful for processing large HTML documents without loading them entirely into memory.
const parse5 = require('parse5');
const fs = require('fs');
const file = fs.createReadStream('example.html');
const parser = new parse5.SAXParser();
parser.on('text', (text) => {
console.log(text);
});
file.pipe(parser);
Tree Adapters
This feature allows you to use different tree adapters to interact with the parsed document tree in a way that is compatible with other libraries or your own custom requirements.
const parse5 = require('parse5');
const htmlparser2Adapter = require('parse5-htmlparser2-tree-adapter');
const html = '<!DOCTYPE html><html><head></head><body>Hi there!</body></html>';
const document = parse5.parse(html, { treeAdapter: htmlparser2Adapter });
Cheerio is a fast, flexible, and lean implementation of core jQuery designed specifically for the server. It uses a very similar syntax to jQuery, allowing for manipulation of the elements in the parsed document tree. Cheerio is generally faster than parse5 for querying documents but does not strictly adhere to the HTML5 specification.
jsdom is a pure-JavaScript implementation of many web standards, notably the WHATWG DOM and HTML Standards, for use with Node.js. It creates a DOM environment similar to that provided by web browsers, including a window object. jsdom is more feature-rich than parse5, providing a complete emulation of a web browser's environment, but it is also heavier and slower for simple parsing tasks.
htmlparser2 is a forgiving HTML and XML parser. It is fast and has a simple API, but unlike parse5, it does not strictly adhere to the HTML5 parsing specification. It is suitable for parsing non-standard or malformed HTML.
Fast full-featured HTML parsing/serialization toolset for Node. Based on WHATWG HTML5 specification.
To build TestCafé we needed fast and ready for production HTML parser, which will parse HTML as a modern browser's parser.
Existing solutions were either too slow or their output was too inaccurate. So, this is how parse5 was born.
##Install
$ npm install parse5
##Simple usage
var Parser = require('parse5').Parser;
//Instantiate parser
var parser = new Parser();
//Then feed it with an HTML document
var document = parser.parse('<!DOCTYPE html><html><head></head><body>Hi there!</body></html>')
//Now let's parse HTML-snippet
var fragment = parser.parseFragment('<title>Parse5 is fucking awesome!</title><h1>42</h1>');
##Is it fast? Check out this benchmark.
Starting benchmark. Fasten your seatbelts...
html5 (https://github.com/aredridel/html5) x 0.18 ops/sec ±5.92% (5 runs sampled)
htmlparser (https://github.com/tautologistics/node-htmlparser/) x 3.83 ops/sec ±42.43% (14 runs sampled)
htmlparser2 (https://github.com/fb55/htmlparser2) x 4.05 ops/sec ±39.27% (15 runs sampled)
parse5 (https://github.com/inikulin/parse5) x 3.04 ops/sec ±51.81% (13 runs sampled)
Fastest is htmlparser2 (https://github.com/fb55/htmlparser2),parse5 (https://github.com/inikulin/parse5)
So, parse5 is as fast as simple specification incompatible parsers and ~15-times(!) faster than the current specification compatible parser available for the node.
##API reference
###Enum: TreeAdapters
Provides built-in tree adapters which can be passed as an optional argument to the Parser
and TreeSerializer
constructors.
####• TreeAdapters.default Default tree format for parse5.
####• TreeAdapters.htmlparser2 Quite popular htmlparser2 tree format (e.g. used in cheerio and jsdom).
###Class: Parser Provides HTML parsing functionality.
####• Parser.ctor([treeAdapter])
Creates new reusable instance of the Parser
. Optional treeAdapter
argument specifies resulting tree format. If treeAdapter
argument is not specified, default
tree adapter will be used.
Example:
var parse5 = require('parse5');
//Instantiate new parser with default tree adapter
var parser1 = new parse5.Parser();
//Instantiate new parser with htmlparser2 tree adapter
var parser2 = new parse5.Parser(parse5.TreeAdapters.htmlparser2);
####• Parser.parse(html)
Parses specified html
string. Returns document
node.
Example:
var document = parser.parse('<!DOCTYPE html><html><head></head><body>Hi there!</body></html>');
####• Parser.parseFragment(htmlFragment, [contextElement])
Parses given htmlFragment
. Returns documentFragment
node. Optional contextElement
argument specifies context in which given htmlFragment
will be parsed (consider it as setting contextElement.innerHTML
property). If contextElement
argument is not specified, <div>
element will be used.
Example:
var documentFragment = parser.parseFragment('<table></table>');
//Parse html fragment in context of the parsed <table> element
var trFragment = parser.parseFragment('<tr><td>Shake it, baby</td></tr>', documentFragment.childNodes[0]);
###Class: TreeSerializer Provides tree-to-HTML serialization functionality.
####• TreeSerializer.ctor([treeAdapter])
Creates new reusable instance of the TreeSerializer
. Optional treeAdapter
argument specifies input tree format. If treeAdapter
argument is not specified, default
tree adapter will be used.
Example:
var parse5 = require('parse5');
//Instantiate new serializer with default tree adapter
var serializer1 = new parse5.TreeSerializer();
//Instantiate new serializer with htmlparser2 tree adapter
var serializer2 = new parse5.TreeSerializer(parse5.TreeAdapters.htmlparser2);
####• TreeSerializer.serialize(node)
Serializes the given node
. Returns HTML string.
Example:
var document = parser.parse('<!DOCTYPE html><html><head></head><body>Hi there!</body></html>');
//Serialize document
var html = serializer.serialize(document);
//Serialize <body> element content
var bodyInnerHtml = serializer.serialize(document.childNodes[0].childNodes[1]);
##Testing Test data is adopted from html5lib project. Parser is covered by more than 8000 test cases. To run tests:
$ node test/run_tests.js
##Custom tree adapter You can create a custom tree adapter so parse5 can work with your own DOM-tree implementation. Just pass your adapter implementation to the parser's constructor as an argument:
var Parser = require('parse5').Parser;
var myTreeAdapter = {
//Adapter methods...
};
//Instantiate parser
var parser = new Parser(myTreeAdapter);
Sample implementation can be found here.
The custom tree adapter should implement all methods exposed via exports
in the sample implementation.
##Questions or suggestions? If you have any questions, please feel free to create an issue here on github.
##Author Ivan Nikulin (ifaaan@gmail.com)
FAQs
HTML parser and serializer.
The npm package parse5 receives a total of 42,984,211 weekly downloads. As such, parse5 popularity was classified as popular.
We found that parse5 demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 5 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Python becomes GitHub's top language in 2024, driven by AI and data science projects, while AI-powered security tools are gaining adoption.
Security News
Dutch National Police and FBI dismantle Redline and Meta infostealer malware-as-a-service operations in Operation Magnus, seizing servers and source code.
Research
Security News
Socket is tracking a new trend where malicious actors are now exploiting the popularity of LLM research to spread malware through seemingly useful open source packages.