Node.js Web Crawler
Supercrawler is a Node.js web crawler. It is designed to be highly configurable and easy to use.
When Supercrawler successfully crawls a page (which could be an image, a text document or any other file), it will fire your custom content-type handlers. Define your own custom handlers to parse pages, save data and do anything else you need.
Features
- Link Detection. Supercrawler will parse crawled HTML documents, identify
links and add them to the queue.
- Robots Parsing. Supercrawler will request robots.txt and check the rules
before crawling. It will also identify any sitemaps.
- Sitemaps Parsing. Supercrawler will read links from XML sitemap files,
and add links to the queue.
- Concurrency Limiting. Supercrawler limits the number of requests sent out
at any one time.
- Rate limiting. Supercrawler will add a delay between requests to avoid
bombarding servers.
- Exponential Backoff Retry. Supercrawler will retry failed requests after 1 hour, then 2 hours, then 4 hours, etc. To use this feature, you must use the database-backed crawl queue.
How It Works
Crawling is controlled by the an instance of the Crawler
object, which acts like a web client. It is responsible for coordinating with the priority queue, sending requests according to the concurrency and rate limits, checking the robots.txt rules and despatching content to the custom content handlers to be processed. Once started, it will automatically crawl pages until you ask it to stop.
The Priority Queue or UrlList keeps track of which URLs need to be crawled, and the order in which they are to be crawled. The Crawler will pass new URLs discovered by the content handlers to the priority queue. When the crawler is ready to crawl the next page, it will call the getNextUrl
method. This method will work out which URL should be crawled next, based on implementation-specific rules. Any retry logic is handled by the queue.
The Content Handlers are functions which take content buffers and do some further processing with them. You will almost certainly want to create your own content handlers to analyze pages or store data, for example. The content handlers tell the Crawler about new URLs that should be crawled in the future. Supercrawler provides content handlers to parse links from HTML pages, analyze robots.txt files for Sitemap:
directives and parse sitemap files for URLs.
Get Started
First, install Supercrawler.
npm install supercrawler --save
Second, create an instance of Crawler
.
var supercrawler = require("supercrawler");
var crawler = new supercrawler.Crawler({
urlList: new supercrawler.DbUrlList({
db: {
database: "crawler",
username: "root",
password: secrets.db.password,
sequelizeOpts: {
dialect: "mysql",
host: "localhost"
}
}
}),
interval: 1000,
concurrentRequestsLimit: 5,
robotsCacheTime: 3600000,
userAgent: "Mozilla/5.0 (compatible; supercrawler/1.0; +https://github.com/brendonboshell/supercrawler)"
});
Third, add some content handlers.
crawler.addHandler(supercrawler.handlers.robotsParser());
crawler.addHandler(supercrawler.handlers.sitemapsParser());
crawler.addHandler("text/html", supercrawler.handlers.htmlLinkParser({
hostnames: ["example.com"]
}));
crawler.addHandler("text/html", function (buf, url) {
var sizeKb = Buffer.byteLength(buf) / 1024;
logger.info("Processed", url, "Size=", sizeKb, "KB");
});
Fourth, add a URL to the queue and start the crawl.
crawler.getUrlList()
.insertIfNotExists(new supercrawler.Url("http://example.com/"))
.then(function () {
return crawler.start();
});
That's it! Supercrawler will handle the crawling for you. You only have to define your custom behaviour in the content handlers.
Crawler
Each Crawler
instance represents a web crawler. You can configure your
crawler with the following options:
Option | Description |
---|
urlList | Custom instance of UrlList type queue. Defaults to FifoUrlList , which processes URLs in the order that they were added to the queue; once they are removed from the queue, they cannot be recrawled. |
interval | Number of milliseconds between requests. Defaults to 1000. |
concurrentRequestsLimit | Maximum number of concurrent requests. Defaults to 5. |
robotsCacheTime | Number of milliseconds that robots.txt should be cached for. Defaults to 3600000 (1 hour). |
userAgent | User agent to use for requests. Defaults to Mozilla/5.0 (compatible; supercrawler/1.0; +https://github.com/brendonboshell/supercrawler) |
Example usage:
var crawler = new supercrawler.Crawler({
interval: 1000,
concurrentRequestsLimit: 1
});
The following methods are available:
Method | Description |
---|
getUrlList | Get the UrlList type instance. |
getInterval | Get the interval setting. |
getConcurrentRequestsLimit | Get the maximum number of concurrent requests. |
getUserAgent | Get the user agent. |
start | Start crawling. |
stop | Stop crawling. |
addHandler(handler) | Add a handler for all content types. |
addHandler(contentType, handler) | Add a handler for a specific content type. |
DbUrlList
DbUrlList
is a queue backed with a database, such as MySQL, Postgres or SQLite. You can use any database engine supported by Sequelize.
If a request fails, this queue will ensure the request gets retried at some point in the future. The next request is schedule 1 hour into the future. After that, the period of delay doubles for each failure.
Options:
Option | Description |
---|
opts.db.database | Database name. |
opts.db.username | Database username. |
opts.db.password | Database password. |
opts.db.sequelizeOpts | Options to pass to sequelize. |
Example usage:
new supercrawler.DbUrlList({
db: {
database: "crawler",
username: "root",
password: "password",
sequelizeOpts: {
dialect: "mysql",
host: "localhost"
}
}
})
The following methods are available:
Method | Description |
---|
insertIfNotExists(url) | Insert a Url object. |
upsert(url) | Upsert Url object. |
getNextUrl() | Get the next Url to be crawled. |
FifoUrlList
The FifoUrlList
is the default URL queue powering the crawler. You can add
URLs to the queue, and they will be crawled in the same order (FIFO).
Note that, with this queue, URLs are only crawled once, even if the request
fails. If you need retry functionality, you must use DbUrlList
.
The following methods are available:
Method | Description |
---|
insertIfNotExists(url) | Insert a Url object. |
upsert(url) | Upsert Url object. |
getNextUrl() | Get the next Url to be crawled. |
Url
A Url
represents a URL to be crawled, or a URL that has already been
crawled. It is uniquely identified by an absolute-path URL, but also contains
information about errors and status codes.
Option | Description |
---|
url | Absolute-path string url |
statusCode | HTTP status code or null . |
errorCode | String error code or null . |
Example usage:
var url = new supercrawler.Url({
url: "https://example.com"
});
You can also call it just a string URL:
var url = new supercrawler.Url("https://example.com");
The following methods are available:
Method | Description |
---|
getUniqueId | Get the unique identifier for this object. |
getUrl | Get the absolute-path string URL. |
getErrorCode | Get the error code, or null if it is empty. |
getStatusCode | Get the status code, or null if it is empty. |
handlers.htmlLinkParser
A function that returns a handler which parses a HTML page and identifies any
links.
Option | Description |
---|
hostnames | Array of hostnames that are allowed to be crawled. |
Example usage:
var hlp = supercrawler.handlers.htmlLinkParser({
hostnames: ["example.com"]
});
handlers.robotsParser
A function that returns a handler which parses a robots.txt file. Robots.txt
file are automatically crawled, and sent through the same content handler
routines as any other file. This handler will look for any Sitemap:
directives,
and add those XML sitemaps to the crawl.
It will ignore any files that are not /robots.txt
.
If you want to extract the URLs from those XML sitemaps, you will also need
to add a sitemap parser.
Example usage:
var rp = supercrawler.handlers.robotsParser();
crawler.addHandler("text/plain", supercrawler.handlers.robotsParser());
handlers.sitemapsParser
A function that returns a handler which parses an XML sitemaps file. It will
pick up any URLs matching sitemapindex > sitemap > loc, urlset > url > loc
.
It will also handle a gzipped file, since that it part of the sitemaps
specification.
Example usage:
var sp = supercrawler.handlers.sitemapsParser();
crawler.addHandler(supercrawler.handlers.sitemapsParser());
Changelog
0.4.0
- [Changed] Supercrawler no longer follows redirects on crawled URLs. Supercrawler will now add a redirected URL to the queue as a separate entry. We still follow redirects for the
/robots.txt
that is used for checking rules; but not for /robots.txt
added to the queue.
0.3.3
- [Fix]
DbUrlList
to mark a URL as taken, and ensure it never returns a URL that is being crawled in another concurrent request. This has required a new field called holdDate
on the url
table
### 0.3.2
- [Fix] Time-based unit tests made more reliable.
0.3.1
- [Added] Support for Travis CI.
0.3.0
- [Added] Content type passed as third argument to all content type handlers.
- [Added] Sitemaps parser to extract sitemap URLs and urlset URLs.
- [Changed] Content handlers receive Buffers rather than strings for the first argument.
- [Fix] Robots.txt checking to work for the first crawled URL. There was a bug that caused robots.txt to be ignored if it wasn't in the cache.
0.2.3
- [Added] A robots.txt parser that identifies
Sitemap:
directives.
0.2.2
- [Fixed] Support for URLs up to 10,000 characters long. This required a new
urlHash
SHA1 field on the url
table, to support the unique index.
0.2.1
- [Added] Extensive documentation.
0.2.0
- [Added] Status code is updated in the queue for successfully crawled pages (HTTP code < 400).
- [Added] A new error type
error.RequestError
for all errors that occur when requesting a page. - [Added]
DbUrlList
queue object that stores URLs in a SQL database. Includes exponetial backoff retry logic. - [Changed] Interface to
DbUrlList
and FifoUrlList
is now via methods insertIfNotExists
, upsert
and getNextUrl
. Previously, it was just insert
(which also updated) and upsert
, but we need a way to differentiate between discovered URLs which should not update the crawl state.
0.1.0
- [Added]
Crawler
object, supporting rate limiting, concurrent requests limiting, robots.txt caching. - [Added]
FifoUrlList
object, a first-in, first-out in-memory list of URLs to be crawled. - [Added]
Url
object, representing a URL in the crawl queue. - [Added]
htmlLinkParser
, a function to extract links from crawled HTML documents.