HTML Parser
This is a simple HTML parser for Node.js, created using htmlparser2
package. It provides a HtmlParser
class which can be used to query and manipulate HTML documents. The class also provides some useful methods like getAttribute
, getText
, getParent
, getChildren
, getOuterHTML
, getInnerHTML
, getNext
, getPrev
, querySelector
, querySelectorAll
, and getListData
.
Installation
You can install this package using npm
:
npm install @hd/html-parser
Usage
Here's an example of how to use this package:
import HDHtmlParser from '@hd/html-parser';
const html = <html> <head> <title>My title</title> </head> <body> <h1>Heading 1</h1> <p>Paragraph 1</p> <p>Paragraph 2</p> </body> </html>
;
const document
= await HDHtmlParser(html);
const title = document
.querySelector('title')
.getText();
const paragraphs = document
.querySelectorAll('p')
.map(p
=> p.getText());
API
HtmlParser
The HtmlParser
class provides the following
methods:
querySelector(selector: string): HtmlParser | null
Returns a new
HtmlParser
object that matches the first element in the document that matches the specified selector
, or null
if no such element exists.
querySelectorAll(selector: string):
HtmlParser[]
Returns an array of HtmlParser
objects that match all elements in the document
that match the specified selector
. If no elements match, an empty array is returned.
getHtml(): string | null
Returns the HTML string of the document, or null
if an error
occurs.
getAttribute(name: string): string | null | undefined
Returns the value of the
specified attribute on the current element, or null
if the attribute does not exist or an error
occurs.
getText(): string | null
Returns the text content of the current element, or
null
if an error occurs.
getParent(): HtmlParser
Returns a new HtmlParser
object that represents the parent of the current element.
getChildren(): HtmlParser[]
Returns
an array of HtmlParser
objects that represent the children of the current element.
getOuterHTML(): string | null
Returns the outer HTML of the current element, or null
if an
error occurs.
getInnerHTML(): string | null
Returns the inner HTML of the current element, or
an empty string if the element has no children.
getNext(): HtmlParser | null
Returns a
new HtmlParser
object that represents the next sibling of the current element, or null
if no
such element exists.
getPrev(): HtmlParser | null
Returns a new HtmlParser
object that represents the previous sibling of the current element, or null
if no such element
exists.
getListData(selector: string, itemsSelector: any, meta: any): object | null
Returns
an object that represents the data extracted from a list of elements that match the specified selector
.
The itemsSelector
argument is an object that maps the keys of the output object to functions that extract
the data from the corresponding elements. The meta
argument is an object that can be used to pass
additional information to the extractor functions. Returns null
if an error occurs.
HDHtmlParser(
html: string): Promise<HtmlParser | null>
This is the main function of the package. It takes an HTML string
as input and returns a Promise
that resolves to