Web Crawler with Redis Graph
![downloads](https://img.shields.io/npm/dm/redis-web-crawler.svg?style=flat-square)
Read the blog post.
Web Crawler built with NodeJS. Fetch site data from a given URL and recursively follow links across the web.
Search the sites with either breadth first search, or depth first search.
Every URL will be saved to a Graph (using an adjacency list). The Graph is stored with Redis.
Installation
npm install --save redis-web-crawler
Usage
Run a local redis server to store output:
$ redis-server
Create a new crawler instance and pass in a configuration object. Call the run
method to begin crawling.
import WebCrawler from 'redis-web-crawler';
const crawlerSettings = {
startUrl: 'https://en.wikipedia.org/wiki/Main_Page',
followInternalLinks: false,
searchDepthLimit: null,
searchAlgorithm: "breadthFirstSearch",
}
var crawler = new WebCrawler(crawlerSettings);
crawler.run();
Configuration Properties
Name | Type | Description |
---|
startUrl | string | A valid URL off a page with links. |
followInternalLinks | boolean | Toggle searching through internal site links |
searchDepthLimit | integer | Set a limit on the recursive URL requests |
searchAlgorithm | string | "breadthFirstSearch" or "depthFirstSearch" |
Exporting the Redis Graph
- clone the Redis Dump Repo
- run commands to install gem dependencies (refer to redis-dump/README)
- with redis server up and running:
- note the
slave
and port
of the redis-server (e.g. 6371) - in project root folder, run
./bin/redis-dump -u 127.0.0.1:6371 > db_full.json
- view the Redis export in
db_full.json
spencerlepine.com · GitHub @spencerlepine · Twitter @spencerlepine