Socket
Socket
Sign inDemoInstall

web-auto-extractor

Package Overview
Dependencies
97
Maintainers
1
Versions
38
Alerts
File Explorer

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

    web-auto-extractor

Automatically extracts structured information from webpages


Version published
Weekly downloads
7.7K
decreased by-15.8%
Maintainers
1
Created
Weekly downloads
 

Readme

Source

Web Auto Extractor

Build Status

Automatically extracts semantically structured information from any HTML webpage.

Supported formats:-

  • Formats that support Schema.org vocabularies:-
    • Microdata
    • RDFa-lite
    • JSON-LD
  • Miscellaneous meta tags

Demo it on tonicdev

Introduction

Parse any sematically structured HTML and query on it.

import WAE from 'web-auto-extractor'
import request from 'request'

const pageUrl = 'http://southernafricatravel.com/'

request(pageUrl, function (error, response, body) {
  let wae = WAE.init(body)
  // console.log(wae.parse())

  // If the page uses microdata
  let waeMicrodata = wae.parseMicrodata()
  // See API for more options
  // console.log(waeMicrodata.data())

  // You can query on the parsed result to look for properties marked up by the page
  let images = waeMicrodata.find('telephone')
  // console.log(images)
})
CommonJS import style
var WAE = require('web-auto-extractor').default

Installation

npm install web-auto-extractor

API

Initializing

You would first need to load in the HTML to get a WAEObject

const wae = WAE.init('<div itemtype="Product">...</div>')

Each WAEObject comes with the following set of methods

WAEObject Methods

NOTE: The result of these functions are cached, so multiple calls to them shouldn't affect performance.

.parse()

Finds all supported semantically structured information on the HTML in normalized format.

.parseMicrodata()

Finds all Microdata information on the page and returns it as a WAEParserObject.

.parseRdfa()

Finds all RDFa-Lite information on the page and returns it as a WAEParserObject.

.parseJsonld()

Finds all JSON-LD information on the page and returns it as a WAEParserObject.

.parseMetaTags()

Finds all meta tags information on the page and returns it as a WAEParserObject.

WAEParserObject Attributes

NOTE: The result of these functions are cached, so multiple calls to them shouldn't affect performance.

.data()

Gets the normalized result of the parsed format.

.unnormalizedData()

Gets the unnormalized flattened result of the parsed format which includes meta information relating to the parsed properties.

.find(propName)

Returns a list of elements from .data() that corresponds to the property with the name [propName].

See test cases for more examples.

Keywords

FAQs

Last updated on 10 Jun 2016

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc