contract-scraper

Package Overview

Dependencies

Maintainers

Versions

Alerts

File Explorer

Advanced tools

License

Install Socket

Detect and block malicious and high-risk dependencies

Install

contract-scraper

A customisable data scraper for the web based on JSON contracts

latest

Source

npm

Version: 6.0.0

Version published: last year

Maintainers: 0

Created: 6 years ago

Source

contract-scraper

With contract-scraper you can easily scrape a HTML page and return the data in a structured format.

npm

Installation

npm install contract-scraper --save

yarn add contract-scraper

Usage

To scrape a page, you can create a new instance of contract-scraper with these parameters:

let contract = {
  itemSelector: 'li',
  puppeteer: true,
  attributes: {
    name: {
      type: 'text',
      selector: '.name'
    },
    link: {
      type: 'link',
      selector: 'a',
      attribute: 'href'
    }
  }
}

const puppeteerOptions = {
  headless: false,
}

const scraper = new Scraper('http://website.com', contract, puppeteerOptions)

A scraper can be initialised with custom puppeteer launch options.

A contract accepts the following properties:

`itemSelector` (string)

A CSS selector for the element to be scraped. The scraper will process all the elements matching this selector.

`puppeteer` (boolean)

If set to true contract-scraper will use Puppeteer to load and scrape the page contents

`waitForPageLoadSelector` (string)

Puppeteer will wait for this CSS selector to exist in the DOM before scraping the page. Must be used in conjunction with pupeeteer: true

`attributes` (object)

Defines the data to scrape for each item.

Each attribute matches a HTML element to scrape. The attribute type will define how data wil be extracted from the element, and how the data should be formatted in the final output. For example you can use one of the in-built types to extract a number from an element:

<ul>
  <li>
    <div class="name">Iron man</div>
    <div class="price">100 euros</div>
  </li>
  <li>
    <div class="name">Captain America</div>
    <div class="price">500 euros</div>
  </li>
  <ul></ul>
</ul>

const contract = {
  itemSelector: 'li',
  attributes: {
    name: {
      type: 'text',
      selector: '.name',
    },
    price: {
      type: 'number',
      selector: '.price',
    },
  },
};

const scraper = new Scraper('http://characters.com', contract);

scraper.scrapePage().then(items => {
  console.log(items);
  // [
  //   {
  //     name: 'Iron man',
  //     price: 100
  //   },
  //       {
  //     name: 'Captain America',
  //     price: 500
  //   }
  // ]
});

Each attribute can have the following properties:

name (string) - A label for this attribute for the final output
selector (string) - The CSS selector for the element (scoped to itemSelector).
type (string) - A custom type, or one of the in-built ones that returns:
- background-image: A background-image url from a style string
- link: An absolute URL
- number: A number
- size: A number for size in m².
- text: Inner text of the element
attribute (optional) (string)

The name of the HTML attribute to scrape data from. E.g. for an element:
```
<a href="http://linktoscrape">Homepage</a>
```
```
  {
    name: 'URL',
    type: 'link',
    selector: 'a',
    attribute: 'href'
  }
```
By default the attribute type will use the innerText of the element if attribute is not specified.

data (optional) (object) - If you want to scrape HTML data attributes you can do it in two ways:

Directly scraping a data attribute:

<div data-country="Australia"></div>

{
  name: 'Country',
  type: 'text',
  selector: 'data-country',
  data: { name: 'country' }
}

This will return "Australia" in your list of results.

For scraping a JSON value inside a data attribute:

<div data-price="{currency: 'aud'}"></div>

{
  name: 'Price',
  type: 'number',
  selector: 'data-price',
  data: { name: 'price', key: 'currency'}
}

This will return "aud" in your list of results.

Nested attributes

It's also possible to scrape nested attributes, like a list inside an item:

<ul class="friends">
  <li>
    <span>Spiderman</span>
    <ul>
      <li><strong>Iron</strong><em>Man</em></li>
      <li><strong>Captain</strong><em>America</em></li>
    </ul>
  </li>
</ul>

The contract:

{
  "itemSelector": ".friends li",
  "attributes": {
    "name": { "type": "text", "selector": "span" },
    "friends": {
      "itemSelector": "ul li",
      "attributes": {
        "firstName": { "type": "text", "selector": "strong" },
        "lastName": { "type": "text", "selector": "em" }
      }
    }
  }
}

So this will return all the friends as an array (using any type):

[
  {
    name: 'Spiderman',
    friends: [
      { firstName: 'Iron', lastName: 'Man' },
      { firstName: 'Captain', lastName: 'America' },
    ],
  },
];

Custom attributes types

In addition to the in-built attribute types, you can provide your own when you create a new instance of the scraper. A custom attribute type needs to be a class or a function that has a value property. As a constructor argument it will receive the string innerText value from the matching element. Then you can format it however you like.

For example if you wanted to extract a list of tags and format them as an array:

<ul>
  <li>
    <div class="name">Australia</div>
    <div class="tags">spiders,vegemite,scorching,heat</div>
  </li>
</ul>

import Scraper from 'contract-scraper';

const contract = {
  itemSelector: 'li',
  attributes: {
    countryName: {
      type: 'text',
      selector: '.name',
    },
    tags: {
      type: 'list',
      selector: '.tags',
    },
  },
};

function ListFromString(commaSeparatedString) {
  return commaSeparatedString.split(',');
}

const scraper = new Scraper('http://countries.com', contract, {
  list: ListFromString,
});

scraper.scrapePage().then(items => {
  console.log(items);
  // [
  //   {
  //     countryName: 'Australia',
  //     tags: [ 'spiders', 'vegemite', 'scorching', 'heat' ]
  //   }
  // ]
});

Parsing JSON inside script tags

Sometimes you may want to extract values from inside a script tag on the page. For the moment, contract-scraper only supports parsing JSON. For example:

<html>
  <head>
    <title>Page with a script tag</title>
  </head>
  <body>
    <script type="application/ld+json" id="info">
      {
        "characters": [
          {
            "name": "Jon Snow",
            "friends": [
              { "firstName": "Sansa", "lastName": "Stark" },
              { "firstName": "Bran", "lastName": "Stark" },
              { "firstName": "Arya", "lastName": "Stark" }
            ],
            "photo": "http://images.com/jonsnow",
            "price": {
              "amount": "12345 dollars"
            }
          },
          {
            "name": "Ned Stark",
            "friends": [
              { "firstName": "Sansa", "lastName": "Stark" },
              { "firstName": "Bobby", "lastName": "B" },
              { "firstName": "Little", "lastName": "finger" }
            ],
            "photo": "http://images.com/nedstark",
            "price": {
              "amount": "6789 euros"
            }
          }
        ]
      }
    </script>
  </body>
</html>

const contract = {
  scriptTagSelector: '#info',
  itemSelector: 'characters',
  attributes: {
    name: { type: 'text', selector: 'name' },
    friends: {
      itemSelector: 'friends',
      attributes: {
        firstName: { type: 'text', selector: 'firstName' },
        lastName: { type: 'text', selector: 'lastName' },
      },
    },
    photo: { type: 'link', selector: 'photo' },
    price: { type: 'number', selector: 'price.amount' },
  },
};

const scraper = new Scraper('http://characters.com', contract);

scraper.scrapePage().then(items => {
  console.log(items);
  // [
  //   {
  //     "name": "Jon Snow",
  //     "friends": [
  //       {
  //         "firstName": "Sansa",
  //         "lastName": "Stark"
  //       },
  //       {
  //         "firstName": "Bran",
  //         "lastName": "Stark"
  //       },
  //       {
  //         "firstName": "Arya",
  //         "lastName": "Stark"
  //       }
  //     ],
  //     "photo": "http://images.com/jonsnow",
  //     "price": 12345
  //   },
  //   {
  //     "name": "Ned Stark",
  //     "friends": [
  //       {
  //         "firstName": "Sansa",
  //         "lastName": "Stark"
  //       },
  //       {
  //         "firstName": "Bobby",
  //         "lastName": "B"
  //       },
  //       {
  //         "firstName": "Little",
  //         "lastName": "finger"
  //       }
  //     ],
  //     "photo": "http://images.com/nedstark",
  //     "price": 6789
  //   }
  // ]
});

Keywords

FAQs

What is contract-scraper?

Is contract-scraper well maintained?

Package last updated on 22 Jul 2024

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

contract-scraper

contract-scraper

Installation

Usage

itemSelector (string)

puppeteer (boolean)

waitForPageLoadSelector (string)

attributes (object)

Nested attributes

Custom attributes types

Parsing JSON inside script tags

Keywords

Related posts

Announcing Socket Fix 2.0

Feross on Risky Business Weekly Podcast: npm’s Ongoing Supply Chain Attacks

`itemSelector` (string)

`puppeteer` (boolean)

`waitForPageLoadSelector` (string)

`attributes` (object)