reqscraper
Lightweight wrapper for Request and X-Ray JS.
Sample Usage
This module contains the requestJS
for making HTTP requests, and x-ray
for easily scraping websites, called req
and scrape
respectively.
Both return promise. req
has internal control structure to retry request up to 5 times for failsafe.
Brief API doc
-
req(options)
, where options
is a request options object. See requestJS
for full detail.
-
scrape(dyn, url, scope, selector)
, where dyn
is the boolean to use dynamic scraping using x-ray-phantom
; url
is the page url, scope
and selector
are some HTML selectors. See x-ray
for full detail.
-
scrapeCrawl(dyn, url, selector, tailArr, [limit])
, where dyn
is true for dynamic scraping using x-ray-phantom
;
req(options)
Convenient wrapper for request js
- HTTP request method that returns a promise.
param | desc |
---|
options | A request options object. See requestJS for full detail. |
var scraper = require('reqscraper');
var req = scraper.req;
var options = {
method: 'GET',
url: 'https://www.google.com',
headers: {
'Accept': 'application/json',
'Authorization': 'some_auth_details'
}
}
return req(options)
.then(console.log)
.catch(console.log)
scrape(dyn, url, scope, selector)
Scraper that returns a promise. Backed by x-ray
.
param | desc |
---|
dyn | the boolean to use dynamic scraping using x-ray-phantom |
url | the page url to scrape |
[scope] | Optional scope to narrow now the target HTML for selector |
selector | HTML selector. See x-ray for full detail. |
var scraper = require('reqscraper');
var scrape = scraper.scrape;
return scrape(false, 'https://www.google.com', 'body')
.then(console.log)
return scrape(false, 'https://www.google.com', 'body', ['li'])
.then(console.log)
scrapeCrawl(dyn, url, selector, tailArr)
An extension of scrape
above with crawling capability. Returns a promise with results in a tree-like JSON structure. Crawls by a breath-first tree structure, and does not crawl deeper if the root of a branch is not crawlable.
param | desc |
---|
dyn | the boolean to use dynamic scraping using x-ray-phantom |
url | the base page url to scrape and crawl from |
selector | The selector for the base page (first level) |
tailArr | An array of selectors for each level to crawl. Note that a preceeding selector must specify the urls to crawl via hrefs . |
[limit] | An optional integer to limit the number of children crawled at every level. |
var scraper = require('reqscraper');
var scrapeCrawl = scraper.scrapeCrawl;
var dc = scrapeCrawl.bind(null, true)
var sc = scrapeCrawl.bind(null, false)
var selector0 = {
img: ['.dribbble-img'],
h1: ['h1'],
hrefs: ['.next_page@href']
}
var selector1 = {
h1: ['h1'],
hrefs: ['.next_page@href']
}
var selector2 = {
h1: ['h1']
}
sc(
'https://dribbble.com',
selector0,
[selector1, selector1, selector1, selector2]
)
.then(function(res){
console.log(JSON.stringify(res, null, 2))
})
sc(
'https://dribbble.com',
selector0,
[selector1, selector1, selector1, selector2],
3
)
.then(function(res){
console.log(JSON.stringify(res, null, 2))
})
Changelog
Aug 18 2015
- Added scrapecrawl, basically a scraper extended from
scrape
that can also crawl. - Updated README for better API doc.