WHATWG HTML5 specification-compliant, fast and ready for production HTML parsing/serialization toolset for Node.
To build TestCafé we needed fast and ready for production HTML parser, which will parse HTML as a modern browser's parser.
Existing solutions were either too slow or their output was too inaccurate. So, this is how parse5 was born.
Included tools:
##Install
$ npm install parse5
##Usage
var Parser = require('parse5').Parser;
var parser = new Parser();
var document = parser.parse('<!DOCTYPE html><html><head></head><body>Hi there!</body></html>')
var fragment = parser.parseFragment('<title>Parse5 is fucking awesome!</title><h1>42</h1>');
##Is it fast?
Check out this benchmark.
Starting benchmark. Fasten your seatbelts...
html5 (https://github.com/aredridel/html5) x 0.18 ops/sec ±5.92% (5 runs sampled)
htmlparser (https://github.com/tautologistics/node-htmlparser/) x 3.83 ops/sec ±42.43% (14 runs sampled)
htmlparser2 (https://github.com/fb55/htmlparser2) x 4.05 ops/sec ±39.27% (15 runs sampled)
parse5 (https://github.com/inikulin/parse5) x 3.04 ops/sec ±51.81% (13 runs sampled)
Fastest is htmlparser2 (https://github.com/fb55/htmlparser2),parse5 (https://github.com/inikulin/parse5)
So, parse5 is as fast as simple specification incompatible parsers and ~15-times(!) faster than the current specification compatible parser available for the node.
##API reference
###Enum: TreeAdapters
Provides built-in tree adapters which can be passed as an optional argument to the Parser
and Serializer
constructors.
####• TreeAdapters.default
Default tree format for parse5.
####• TreeAdapters.htmlparser2
Quite popular htmlparser2 tree format (e.g. used in cheerio and jsdom).
###Class: Parser
Provides HTML parsing functionality.
####• Parser.ctor([treeAdapter])
Creates new reusable instance of the Parser
. Optional treeAdapter
argument specifies resulting tree format. If treeAdapter
argument is not specified, default
tree adapter will be used.
Example:
var parse5 = require('parse5');
var parser1 = new parse5.Parser();
var parser2 = new parse5.Parser(parse5.TreeAdapters.htmlparser2);
####• Parser.parse(html)
Parses specified html
string. Returns document
node.
Example:
var document = parser.parse('<!DOCTYPE html><html><head></head><body>Hi there!</body></html>');
####• Parser.parseFragment(htmlFragment, [contextElement])
Parses given htmlFragment
. Returns documentFragment
node. Optional contextElement
argument specifies context in which given htmlFragment
will be parsed (consider it as setting contextElement.innerHTML
property). If contextElement
argument is not specified then <template>
element will be used as a context and fragment will be parsed in 'forgiving' manner.
Example:
var documentFragment = parser.parseFragment('<table></table>');
var trFragment = parser.parseFragment('<tr><td>Shake it, baby</td></tr>', documentFragment.childNodes[0]);
###Class: SimpleApiParser
Provides SAX-style HTML parsing functionality.
####• SimpleApiParser.ctor(handlers)
Creates new reusable instance of the SimpleApiParser
. handlers
argument specifies object that contains parser's event handlers. Possible events and their signatures are shown in the example.
Example:
var parse5 = require('parse5');
var parser = new parse5.SimpleApiParser({
doctype: function(name, publicId, systemId) {
},
startTag: function(tagName, attrs, selfClosing) {
},
endTag: function(tagName) {
},
text: function(text) {
},
comment: function(text) {
}
});
####• SimpleApiParser.parse(html)
Raises parser events for the given html
.
Example:
var parse5 = require('parse5');
var parser = new parse5.SimpleApiParser({
text: function(text) {
console.log(text);
}
});
parser.parse('<body>Yo!</body>');
###Class: Serializer
Provides tree-to-HTML serialization functionality.
Note: prior to v1.2.0 this class was called TreeSerializer
. However, it's still accessible as parse5.TreeSerializer
for backward compatibility.
####• Serializer.ctor([treeAdapter, options])
Creates new reusable instance of the Serializer
. Optional treeAdapter
argument specifies input tree format. If treeAdapter
argument is not specified, default
tree adapter will be used.
options
object provides the serialization algorithm modifications (Warning: switching default options causes HTML5 specification violation. However, it may be useful in some cases, e.g. markup instrumentation. Use it on your own risk.)
- options.encodeHtmlEntities - HTML-encode characters like
<
, >
, &
, etc. Default: true
.
Example:
var parse5 = require('parse5');
var serializer1 = new parse5.Serializer();
var serializer2 = new parse5.Serializer(parse5.TreeAdapters.htmlparser2);
####• Serializer.serialize(node)
Serializes the given node
. Returns HTML string.
Example:
var document = parser.parse('<!DOCTYPE html><html><head></head><body>Hi there!</body></html>');
var html = serializer.serialize(document);
var bodyInnerHtml = serializer.serialize(document.childNodes[0].childNodes[1]);
##Testing
Test data is adopted from html5lib project. Parser is covered by more than 8000 test cases.
To run tests:
$ npm test
##Custom tree adapter
You can create a custom tree adapter so parse5 can work with your own DOM-tree implementation.
Just pass your adapter implementation to the parser's constructor as an argument:
var Parser = require('parse5').Parser;
var myTreeAdapter = {
};
var parser = new Parser(myTreeAdapter);
Sample implementation can be found here.
The custom tree adapter should implement all methods exposed via exports
in the sample implementation.
##Questions or suggestions?
If you have any questions, please feel free to create an issue here on github.
##Author
Ivan Nikulin (ifaaan@gmail.com)