
This project is a fork of indix/web-auto-extractor.
Parse semantically structured information from any HTML webpage.
Supported formats:
- Encodings that support Schema.org vocabularies:
- Microdata
- RDFa-lite
- JSON-LD
- Meta tags
- Heading tags
Popularly, many websites mark up their webpages with Schema.org vocabularies for better SEO. This library helps you parse that information to JSON.
Installation
npm i --save @marbec/web-auto-extractor
import WebAutoExtractor from '@marbec/web-auto-extractor';
const parsed = new WebAutoExtractor({
addLocation: false,
embedSource: false,
skipEmptyHeadings: false,
skipLayoutElements: false,
}).parse(sampleHTML);
Browser
You can run the parser directly in the browser on any website using the following commands:
const { default: WebAutoExtractor } = await import(
'https://unpkg.com/@marbec/web-auto-extractor@latest/dist/index.js'
);
new WebAutoExtractor().parse(document.documentElement.outerHTML);
Examples
See test cases for sample in- and outputs.