
Product
Rust Support Now in Beta
Socket's Rust support is moving to Beta: all users can scan Cargo projects and generate SBOMs, including Cargo.toml-only crates, with Rust-aware supply chain checks.
contract-scraper
Advanced tools
With contract-scraper you can easily scrape a HTML page and return the data in a structured format.
npm install contract-scraper --save
yarn add contract-scraper
To scrape a page, you can create a new instance of contract-scraper
with these parameters:
let contract = {
itemSelector: 'li',
puppeteer: true,
attributes: {
name: {
type: 'text',
selector: '.name'
},
link: {
type: 'link',
selector: 'a',
attribute: 'href'
}
}
}
const puppeteerOptions = {
headless: false,
}
const scraper = new Scraper('http://website.com', contract, puppeteerOptions)
A scraper can be initialised with custom puppeteer launch options.
A contract accepts the following properties:
itemSelector
(string)A CSS selector for the element to be scraped. The scraper will process all the elements matching this selector.
puppeteer
(boolean)If set to true contract-scraper will use Puppeteer to load and scrape the page contents
waitForPageLoadSelector
(string)Puppeteer will wait for this CSS selector to exist in the DOM before scraping the page. Must be used in conjunction with pupeeteer: true
attributes
(object)Defines the data to scrape for each item.
Each attribute matches a HTML element to scrape. The attribute type will define how data wil be extracted from the element, and how the data should be formatted in the final output. For example you can use one of the in-built types to extract a number from an element:
<ul>
<li>
<div class="name">Iron man</div>
<div class="price">100 euros</div>
</li>
<li>
<div class="name">Captain America</div>
<div class="price">500 euros</div>
</li>
<ul></ul>
</ul>
const contract = {
itemSelector: 'li',
attributes: {
name: {
type: 'text',
selector: '.name',
},
price: {
type: 'number',
selector: '.price',
},
},
};
const scraper = new Scraper('http://characters.com', contract);
scraper.scrapePage().then(items => {
console.log(items);
// [
// {
// name: 'Iron man',
// price: 100
// },
// {
// name: 'Captain America',
// price: 500
// }
// ]
});
Each attribute can have the following properties:
name
(string) - A label for this attribute for the final output
selector
(string) - The CSS selector for the element (scoped to itemSelector).
type
(string) - A custom type, or one of the in-built ones that returns:
background-image
: A background-image url from a style stringlink
: An absolute URLnumber
: A numbersize
: A number for size in m².text
: Inner text of the elementattribute (optional)
(string)
The name of the HTML attribute to scrape data from. E.g. for an element:
<a href="http://linktoscrape">Homepage</a>
{
name: 'URL',
type: 'link',
selector: 'a',
attribute: 'href'
}
By default the attribute type will use the innerText of the element if attribute
is not specified.
data (optional)
(object) - If you want to scrape HTML data attributes you can do it in two ways:
<div data-country="Australia"></div>
{
name: 'Country',
type: 'text',
selector: 'data-country',
data: { name: 'country' }
}
This will return "Australia" in your list of results.<div data-price="{currency: 'aud'}"></div>
{
name: 'Price',
type: 'number',
selector: 'data-price',
data: { name: 'price', key: 'currency'}
}
This will return "aud" in your list of results.It's also possible to scrape nested attributes, like a list inside an item:
<ul class="friends">
<li>
<span>Spiderman</span>
<ul>
<li><strong>Iron</strong><em>Man</em></li>
<li><strong>Captain</strong><em>America</em></li>
</ul>
</li>
</ul>
The contract:
{
"itemSelector": ".friends li",
"attributes": {
"name": { "type": "text", "selector": "span" },
"friends": {
"itemSelector": "ul li",
"attributes": {
"firstName": { "type": "text", "selector": "strong" },
"lastName": { "type": "text", "selector": "em" }
}
}
}
}
So this will return all the friends
as an array (using any type):
[
{
name: 'Spiderman',
friends: [
{ firstName: 'Iron', lastName: 'Man' },
{ firstName: 'Captain', lastName: 'America' },
],
},
];
In addition to the in-built attribute types, you can provide your own when you create a new instance of the scraper. A custom attribute type needs to be a class or a function that has a value
property. As a constructor argument it will receive the string innerText value from the matching element. Then you can format it however you like.
For example if you wanted to extract a list of tags and format them as an array:
<ul>
<li>
<div class="name">Australia</div>
<div class="tags">spiders,vegemite,scorching,heat</div>
</li>
</ul>
import Scraper from 'contract-scraper';
const contract = {
itemSelector: 'li',
attributes: {
countryName: {
type: 'text',
selector: '.name',
},
tags: {
type: 'list',
selector: '.tags',
},
},
};
function ListFromString(commaSeparatedString) {
return commaSeparatedString.split(',');
}
const scraper = new Scraper('http://countries.com', contract, {
list: ListFromString,
});
scraper.scrapePage().then(items => {
console.log(items);
// [
// {
// countryName: 'Australia',
// tags: [ 'spiders', 'vegemite', 'scorching', 'heat' ]
// }
// ]
});
Sometimes you may want to extract values from inside a script tag on the page. For the moment, contract-scraper
only supports parsing JSON. For example:
<html>
<head>
<title>Page with a script tag</title>
</head>
<body>
<script type="application/ld+json" id="info">
{
"characters": [
{
"name": "Jon Snow",
"friends": [
{ "firstName": "Sansa", "lastName": "Stark" },
{ "firstName": "Bran", "lastName": "Stark" },
{ "firstName": "Arya", "lastName": "Stark" }
],
"photo": "http://images.com/jonsnow",
"price": {
"amount": "12345 dollars"
}
},
{
"name": "Ned Stark",
"friends": [
{ "firstName": "Sansa", "lastName": "Stark" },
{ "firstName": "Bobby", "lastName": "B" },
{ "firstName": "Little", "lastName": "finger" }
],
"photo": "http://images.com/nedstark",
"price": {
"amount": "6789 euros"
}
}
]
}
</script>
</body>
</html>
const contract = {
scriptTagSelector: '#info',
itemSelector: 'characters',
attributes: {
name: { type: 'text', selector: 'name' },
friends: {
itemSelector: 'friends',
attributes: {
firstName: { type: 'text', selector: 'firstName' },
lastName: { type: 'text', selector: 'lastName' },
},
},
photo: { type: 'link', selector: 'photo' },
price: { type: 'number', selector: 'price.amount' },
},
};
const scraper = new Scraper('http://characters.com', contract);
scraper.scrapePage().then(items => {
console.log(items);
// [
// {
// "name": "Jon Snow",
// "friends": [
// {
// "firstName": "Sansa",
// "lastName": "Stark"
// },
// {
// "firstName": "Bran",
// "lastName": "Stark"
// },
// {
// "firstName": "Arya",
// "lastName": "Stark"
// }
// ],
// "photo": "http://images.com/jonsnow",
// "price": 12345
// },
// {
// "name": "Ned Stark",
// "friends": [
// {
// "firstName": "Sansa",
// "lastName": "Stark"
// },
// {
// "firstName": "Bobby",
// "lastName": "B"
// },
// {
// "firstName": "Little",
// "lastName": "finger"
// }
// ],
// "photo": "http://images.com/nedstark",
// "price": 6789
// }
// ]
});
FAQs
A customisable data scraper for the web based on JSON contracts
We found that contract-scraper demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 0 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Product
Socket's Rust support is moving to Beta: all users can scan Cargo projects and generate SBOMs, including Cargo.toml-only crates, with Rust-aware supply chain checks.
Product
Socket Fix 2.0 brings targeted CVE remediation, smarter upgrade planning, and broader ecosystem support to help developers get to zero alerts.
Security News
Socket CEO Feross Aboukhadijeh joins Risky Business Weekly to unpack recent npm phishing attacks, their limited impact, and the risks if attackers get smarter.