The htmlparser2 npm package is a fast and forgiving HTML and XML parser. It can parse HTML or XML into a DOM-like structure, which can then be manipulated or serialized. It is stream-based, which means it can handle large documents in a memory-efficient manner.

What are htmlparser2's main functionalities?

Parsing HTML to DOM

This feature allows you to parse HTML and handle different parts of the document as they are parsed. The example code sets up event handlers for opening tags, text content, and closing tags, and then parses a simple HTML string.

const htmlparser2 = require('htmlparser2');
const parser = new htmlparser2.Parser({
  onopentag(name, attributes) {
    console.log(name, attributes);
  },
  ontext(text) {
    console.log(text);
  },
  onclosetag(tagname) {
    console.log(tagname);
  }
}, { decodeEntities: true });
parser.write('<div class="test">Hello World</div>');
parser.end();

Streaming Interface

This feature allows you to parse HTML from a stream, such as a file or network response. The example code creates a readable stream from a file and pipes it to the htmlparser2 stream, which logs tag openings, text content, and tag closings.

const htmlparser2 = require('htmlparser2');
const fs = require('fs');
const parser = new htmlparser2.WritableStream({
  onopentag(name) {
    console.log('Opened tag:', name);
  },
  ontext(text) {
    console.log('Text:', text);
  },
  onclosetag(name) {
    console.log('Closed tag:', name);
  }
});
fs.createReadStream('example.html').pipe(parser);

DOM Tree Manipulation

This feature allows you to manipulate the DOM tree after parsing. The example code parses an HTML string into a DOM tree, changes the class attribute of the first element, and then serializes the modified element back to an HTML string.

const htmlparser2 = require('htmlparser2');
const dom = htmlparser2.parseDocument('<div class="test">Hello World</div>');
const divElement = dom.children[0];
divElement.attribs.class = 'new-class';
const serialized = htmlparser2.DomUtils.getOuterHTML(divElement);
console.log(serialized);

Other packages similar to htmlparser2

Readme

Source

#NodeHtmlParser A forgiving HTML/XML/RSS parser written in JS for both the browser and NodeJS (yes, despite the name it works just fine in any modern browser). The parser can handle streams (chunked data) and supports custom handlers for writing custom DOMs/output.

##Installing

npm install htmlparser

##Running Tests

###Run tests under node: node runtests.js

###Run tests in browser: View runtests.html in any browser

##Usage In Node var htmlparser = require("htmlparser"); var rawHtml = "Xyz

##Usage In Browser var handler = new Tautologistics.NodeHtmlParser.DefaultHandler(function (error, dom) { if (error) [...do something for errors...] else [...parsing done, do something...] }); var parser = new Tautologistics.NodeHtmlParser.Parser(handler); parser.parseComplete(document.body.innerHTML); alert(JSON.stringify(handler.dom, null, 2));

##Example output [ { raw: 'Xyz ', data: 'Xyz ', type: 'text' } , { raw: 'script language= javascript' , data: 'script language= javascript' , type: 'script' , name: 'script' , attribs: { language: 'javascript' } , children: [ { raw: 'var foo = '';<' , data: 'var foo = '';<' , type: 'text' } ] } , { raw: '<!-- Waah! -- ' , data: '<!-- Waah! -- ' , type: 'comment' } ]

##Streaming To Parser while (...) { ... parser.parseChunk(chunk); } parser.done();

##Parsing RSS/Atom Feeds

new htmlparser.RssHandler(function (error, dom) {
	...
});

##DefaultHandler Options

###Usage var handler = new htmlparser.DefaultHandler( function (error) { ... } , { verbose: false, ignoreWhitespace: true } );

###Option: ignoreWhitespace Indicates whether the DOM should exclude text nodes that consists solely of whitespace. The default value is "false".

####Example: true The following HTML:
this is the text becomes: [ { raw: 'font' , data: 'font' , type: 'tag' , name: 'font' , children: [ { raw: 'br', data: 'br', type: 'tag', name: 'br' } , { raw: 'this is the text\n' , data: 'this is the text\n' , type: 'text' } , { raw: 'font', data: 'font', type: 'tag', name: 'font' } ] } ]

####Example: false The following HTML:
this is the text becomes: [ { raw: 'font' , data: 'font' , type: 'tag' , name: 'font' , children: [ { raw: '\n\t', data: '\n\t', type: 'text' } , { raw: 'br', data: 'br', type: 'tag', name: 'br' } , { raw: 'this is the text\n' , data: 'this is the text\n' , type: 'text' } , { raw: 'font', data: 'font', type: 'tag', name: 'font' } ] } ]

###Option: verbose Indicates whether to include extra information on each node in the DOM. This information consists of the "raw" attribute (original, unparsed text found between "<" and ">") and the "data" attribute on "tag", "script", and "comment" nodes. The default value is "true".

####Example: true The following HTML: xxx becomes: [ { raw: 'a href="test.html"' , data: 'a href="test.html"' , type: 'tag' , name: 'a' , attribs: { href: 'test.html' } , children: [ { raw: 'xxx', data: 'xxx', type: 'text' } ] } ]

####Example: false The following HTML: xxx becomes: [ { type: 'tag' , name: 'a' , attribs: { href: 'test.html' } , children: [ { data: 'xxx', type: 'text' } ] } ]

###Option: enforceEmptyTags Indicates whether the DOM should prevent children on tags marked as empty in the HTML spec. Typically this should be set to "true" HTML parsing and "false" for XML parsing. The default value is "true".

####Example: true The following HTML: text becomes: [ { raw: 'link', data: 'link', type: 'tag', name: 'link' } , { raw: 'text', data: 'text', type: 'text' } ]

####Example: false The following HTML: text becomes: [ { raw: 'link' , data: 'link' , type: 'tag' , name: 'link' , children: [ { raw: 'text', data: 'text', type: 'text' } ] } ]

##DomUtils

###TBD (see utils_example.js for now)

##Related Projects

Looking for CSS selectors to search the DOM? Try Node-SoupSelect, a port of SoupSelect to NodeJS: http://github.com/harryf/node-soupselect

There's also a port of hpricot to NodeJS that uses HtmlParser for HTML parsing: http://github.com/silentrob/Apricot

FAQs

What is htmlparser2?

Is htmlparser2 popular?

Is htmlparser2 well maintained?

Last updated on 28 Aug 2011

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install