Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

@riteable/scraper

Package Overview
Dependencies
Maintainers
1
Versions
3
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

@riteable/scraper

A basic website scraper.

  • 1.0.2
  • latest
  • npm
  • Socket score

Version published
Maintainers
1
Created
Source

Scraper

A basic website scraper.

Usage

A simple example:

const Scraper = require('@riteable/scraper')

async function run () {
  const scraper = new Scraper()

  scraper
    .setIndexUrl('https://example.com')
    .setLinkSelector('.article .title a')

  return scraper.fetchPages()
}

run()
  .then(console.log)
  .catch(console.error)

The above example would output something like the following:

[
  {
    title: 'Some article',
    description: 'A description of the article.',
    image: 'https://example.com/path/to/an/image.jpg',
    url: 'https://example.com/some-article'
  }
]

An instance of Scraper will try to extract the above data by default.

If you need to extract more data, or don't need the above, you can use the setDataMap() method to specify what you need:

scraper.setDataMap({
  ...scraper.helpers,
  publishedAt: ({ $ }) => $('meta[property="article:published_time"]').attr('content')
})

The helpers have certain fallbacks built-in to look for data. See helpers.js for the implementations.

API

The following properties and methods are available:

helpers: This property contains helper functions to extract commonly needed data. Currently implemented:

  • title()
  • description()
  • image()
  • url()

setIndexUrl(url): Set the URL of a page which contains a list of articles/pages that you want to scrape.

setLinkSelector(selector): Set the selector of the <a> elements which link to the pages to be scraped. This module uses cheerio for parsing and traversing documents.

setDataMap(object): Determine how data should be extracted and mapped to fields. The object only accepts callback functions as values. The callbacks receive an object parameter, which contains a document parsed by cheerio aliased as $, so you can easily query data within the document. The rest of the parameter object contains a needle response from the requested page.

setThrottle(object): You can throttle requests with a delay and concurrent setting. For example:

scraper.setThrottle({
  delay: 500, // milliseconds between requests
  concurrent: 1 // amount of requests at a time
})

async fetchIndex(): Parse data only from the index URL, determined by setIndexUrl().

async fetchPages(): Extract data from linked pages, found by setting setLinkSelector().

Keywords

FAQs

Package last updated on 12 Jun 2024

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc