You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP
Socket
Book a DemoInstallSign in
Socket

@marbec/web-auto-extractor

Package Overview
Dependencies
Maintainers
1
Versions
15
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

@marbec/web-auto-extractor

Automatically extracts structured information from webpages

2.2.0
latest
Source
npmnpm
Version published
Weekly downloads
141
30.56%
Maintainers
1
Weekly downloads
 
Created
Source

Web Auto Extractor 2.0

GitHub License CI NPM Version Node Current

This project is a fork of indix/web-auto-extractor.

Parse semantically structured information from any HTML webpage.

Supported formats:

  • Encodings that support Schema.org vocabularies:
    • Microdata
    • RDFa-lite
    • JSON-LD
  • Meta tags
  • Heading tags

Popularly, many websites mark up their webpages with Schema.org vocabularies for better SEO. This library helps you parse that information to JSON.

Installation

npm i --save @marbec/web-auto-extractor

Usage

import WebAutoExtractor from '@marbec/web-auto-extractor';

const parsed = new WebAutoExtractor({
  // Add location information to the root elements in the parsed data.
  // Location is stored as start,end offset values in the @location property.
  addLocation: false,

  // Embed the source HTML in the root elements in the parsed data using the @source property.
  // This property is either a boolean to embed sources for all data types or an array of data types to embed sources for.
  embedSource: false,

  // Skip headings with empty or whitespace-only text content.
  // When true, headings like <h1></h1> or <h2>   </h2> will be excluded from results.
  skipEmptyHeadings: false,

  // Skip headings that are inside layout elements (header, footer, nav, aside).
  // When true, headings within these semantic layout containers will be excluded from results.
  // The isLayoutElement field is only included when this option is false.
  skipLayoutElements: false,
}).parse(sampleHTML);

// Output format
/* {
    "metatags": {},
    "microdata": {},
    "rdfa": {},
    "jsonld": {},
    "headings": {}
} */

Browser

You can run the parser directly in the browser on any website using the following commands:

const { default: WebAutoExtractor } = await import(
  'https://unpkg.com/@marbec/web-auto-extractor@latest/dist/index.js'
);
new WebAutoExtractor().parse(document.documentElement.outerHTML);

Examples

See test cases for sample in- and outputs.

Keywords

crawler

FAQs

Package last updated on 31 Jul 2025

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts