Intro
A lightweight node spider. Supports:
- FollowLink
- Csutom headers
- Bloom filter
- Retry mechanism
- Proxy Request
- Routing
- Crawling from last visited link
- Free use of parser and memory
Usage
import { Crawler, userAgent } from 'ngrab'
import cheerio from 'cheerio'
let crawler = new Crawler({
name: 'myCrawler',
bloom: true,
interval: () => (Math.random() * 16 + 4) * 1000,
startUrls: ['https://github.com/trending'],
})
crawler.download('trending', async ({ req, res, followLinks, resolveLink }) => {
if (!res) return
let $ = cheerio.load(res.body.toString())
let repoList: Array<{ name: string; href: string }> = [],
$rows = $('.Box-row')
if ($rows.length) {
$rows.each(function (index) {
let $item = $(this)
repoList.push({
name: $('.lh-condensed a .text-normal', $item)
.text()
.replace(/\s+/g, ' ')
.trim(),
href: $('.lh-condensed a', $item).attr('href') as string,
})
})
console.log(repoList)
}
})
crawler.run()
The request hook will execute before each request:
crawler.request('headers', async (context) => {
Object.assign(context.req.headers, {
'Cache-Control': 'no-cache',
'User-Agent': userAgent(),
Accept: '*/*',
'Accept-Encoding': 'gzip, deflate, compress',
Connection: 'keep-alive',
})
})
Routes
Instead of parsing everything in 'crawler.download()', you can split the parsing code into different routes:
crawler.route({
url: 'https://github.com/trending',
async download(({req, res})){
}
})
crawler.route({
url: 'https://github.com/*/*',
async download(({req, res})){
}
})
crawler.route({
url: 'https://github.com/*/*/issues',
async download(({req, res})){
}
})
Proxy
You can provider a proxy server getter when initializing the crawler:
let crawler = new Crawler({
name: 'myCrawler',
startUrls: ['https://github.com/trending'],
async proxy() {
let url = await getProxyUrlFromSomeWhere()
return url
},
})