scrape-it
A Node.js scraper for humans.
:cloud: Installation
$ npm i --save scrape-it
:clipboard: Example
const scrapeIt = require("scrape-it");
scrapeIt("http://ionicabizau.net", {
title: ".header h1"
, desc: ".header h2"
, avatar: {
selector: ".header img"
, attr: "src"
}
}).then(page => {
console.log(page);
});
scrapeIt("http://ionicabizau.net", {
articles: {
listItem: ".article"
, data: {
createdAt: {
selector: ".date"
, convert: x => new Date(x)
}
, title: "a.article-title"
, tags: {
listItem: ".tags > span"
}
, content: {
selector: ".article-content"
, how: "html"
}
}
}
, pages: {
listItem: "li.page"
, name: "pages"
, data: {
title: "a"
, url: {
selector: "a"
, attr: "href"
}
}
}
, title: ".header h1"
, desc: ".header h2"
, avatar: {
selector: ".header img"
, attr: "src"
}
}, (err, page) => {
console.log(err || page);
});
:memo: Documentation
scrapeIt(url, opts, cb)
A scraping module for humans.
Params
- String|Object
url
: The page url or request options. - Object
opts
: The options passed to scrapeHTML
method. - Function
cb
: The callback function.
Return
- Promise A promise object.
scrapeIt.scrapeHTML($, opts)
Scrapes the data in the provided element.
Params
-
Cheerio $
: The input element.
-
Object opts
: An object containing the scraping information.
If you want to scrape a list, you have to use the listItem
selector:
listItem
(String): The list item selector.data
(Object): The fields to include in the list objects:
<fieldName>
(Object|String): The selector or an object containing:
selector
(String): The selector.convert
(Function): An optional function to change the value.how
(Function|String): A function or function name to access the
value.attr
(String): If provided, the value will be taken based on
the attribute name.trim
(Boolean): If false
, the value will not be trimmed
(default: true
).closest
(String): If provided, returns the first ancestor of
the given element.eq
(Number): If provided, it will select the nth element.listItem
(Object): An object, keeping the recursive schema of
the listItem
object. This can be used to create nested lists.
Example:
{
articles: {
listItem: ".article"
, data: {
createdAt: {
selector: ".date"
, convert: x => new Date(x)
}
, title: "a.article-title"
, tags: {
listItem: ".tags > span"
}
, content: {
selector: ".article-content"
, how: "html"
}
, traverseOtherNode: {
selector: ".upperNode"
, closest: "div"
, convert: x => x.length
}
}
}
}
If you want to collect specific data from the page, just use the same
schema used for the data
field.
Example:
{
title: ".header h1"
, desc: ".header h2"
, avatar: {
selector: ".header img"
, attr: "src"
}
}
Return
:yum: How to contribute
Have an idea? Found a bug? See how to contribute.
:moneybag: Donations
Another way to support the development of my open-source modules is
to set up a recurring donation, via Patreon. :rocket:
PayPal donations are appreciated too! Each dollar helps.
Thanks! :heart:
:dizzy: Where is this library used?
If you are using this library in one of your projects, add it in this list. :sparkles:
3abn
—A 3ABN radio client in the terminal.bandcamp-scraper
(by Simon Thiboutôt)—A scraper for https://bandcamp.comcevo-lookup
(by Zack Boehm)—Searchs the CEVO Suspension List for bans by SteamIDcodementor
—A scraper for codementor.io.degusta-scrapper
(by yohendry hurtado)—desgusta scrapper for alexa skillproxylist
(by self_refactor)—Get free proxy listrs-api
(by Alex Kempf)—Simple wrapper for RuneScape APIs written in node.sahibinden
(by Cagatay Cali)—Simple sahibinden.com botsahibindenServer
(by Cagatay Cali)—Simple sahibinden.com bot server sidesgdq-collector
(by Benjamin Congdon)—Collects Twitch / Donation information and pushes data to Firebasetrump-cabinet-picks
(by Linda Haviv)—NYT cabinet predictions for Trump admin.ubersetzung
(by self_refactor)—translate words with examples from German to Englishui-studentsearch
(by Rakha Kanz Kautsar)—API for majapahit.cs.ui.ac.id/studentsearch
:scroll: License
MIT © Ionică Bizău