Security News
pnpm 10.0.0 Blocks Lifecycle Scripts by Default
pnpm 10 blocks lifecycle scripts by default to improve security, addressing supply chain attack risks but sparking debate over compatibility and workflow changes.
supercrawler
Advanced tools
A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.
Supercrawler is a Node.js web crawler. It is designed to be highly configurable and easy to use.
When Supercrawler successfully crawls a page (which could be an image, a text document or any other file), it will fire your custom content-type handlers. Define your own custom handlers to parse pages, save data and do anything else you need.
Crawling is controlled by the an instance of the Crawler
object, which acts like a web client. It is responsible for coordinating with the priority queue, sending requests according to the concurrency and rate limits, checking the robots.txt rules and despatching content to the custom content handlers to be processed. Once started, it will automatically crawl pages until you ask it to stop.
The Priority Queue or UrlList keeps track of which URLs need to be crawled, and the order in which they are to be crawled. The Crawler will pass new URLs discovered by the content handlers to the priority queue. When the crawler is ready to crawl the next page, it will call the getNextUrl
method. This method will work out which URL should be crawled next, based on implementation-specific rules. Any retry logic is handled by the queue.
The Content Handlers are functions which take content buffers and do some further processing with them. You will almost certainly want to create your own content handlers to analyze pages or store data, for example. The content handlers tell the Crawler about new URLs that should be crawled in the future. Supercrawler provides content handlers to parse links from HTML pages, analyze robots.txt files for Sitemap:
directives and parse sitemap files for URLs.
First, install Supercrawler.
npm install supercrawler --save
Second, create an instance of Crawler
.
var supercrawler = require("supercrawler");
// 1. Create a new instance of the Crawler object, providing configuration
// details. Note that configuration cannot be changed after the object is
// created.
var crawler = new supercrawler.Crawler({
// By default, Supercrawler uses a simple FIFO queue, which doesn't support
// retries or memory of crawl state. For any non-trivial crawl, you should
// create a database. Provide your database config to the constructor of
// DbUrlList.
urlList: new supercrawler.DbUrlList({
db: {
database: "crawler",
username: "root",
password: secrets.db.password,
sequelizeOpts: {
dialect: "mysql",
host: "localhost"
}
}
}),
// Tme (ms) between requests
interval: 1000,
// Maximum number of requests at any one time.
concurrentRequestsLimit: 5,
// Time (ms) to cache the results of robots.txt queries.
robotsCacheTime: 3600000,
// Query string to use during the crawl.
userAgent: "Mozilla/5.0 (compatible; supercrawler/1.0; +https://github.com/brendonboshell/supercrawler)",
// Custom options to be passed to request.
request: {
headers: {
'x-custom-header': 'example'
}
}
});
Third, add some content handlers.
// Get "Sitemaps:" directives from robots.txt
crawler.addHandler(supercrawler.handlers.robotsParser());
// Crawl sitemap files and extract their URLs.
crawler.addHandler(supercrawler.handlers.sitemapsParser());
// Pick up <a href> links from HTML documents
crawler.addHandler("text/html", supercrawler.handlers.htmlLinkParser({
// Restrict discovered links to the following hostnames.
hostnames: ["example.com"]
}));
// Match an array of content-type
crawler.addHandler(["text/plain", "text/html"], myCustomHandler);
// Custom content handler for HTML pages.
crawler.addHandler("text/html", function (context) {
var sizeKb = Buffer.byteLength(context.body) / 1024;
logger.info("Processed", context.url, "Size=", sizeKb, "KB");
});
Fourth, add a URL to the queue and start the crawl.
crawler.getUrlList()
.insertIfNotExists(new supercrawler.Url("http://example.com/"))
.then(function () {
return crawler.start();
});
That's it! Supercrawler will handle the crawling for you. You only have to define your custom behaviour in the content handlers.
Each Crawler
instance represents a web crawler. You can configure your
crawler with the following options:
Option | Description |
---|---|
urlList | Custom instance of UrlList type queue. Defaults to FifoUrlList , which processes URLs in the order that they were added to the queue; once they are removed from the queue, they cannot be recrawled. |
interval | Number of milliseconds between requests. Defaults to 1000. |
concurrentRequestsLimit | Maximum number of concurrent requests. Defaults to 5. |
robotsEnabled | Indicates if the robots.txt is downloaded and checked. Defaults to true . |
robotsCacheTime | Number of milliseconds that robots.txt should be cached for. Defaults to 3600000 (1 hour). |
robotsIgnoreServerError | Indicates if 500 status code response for robots.txt should be ignored. Defaults to false . |
userAgent | User agent to use for requests. This can be either a string or a function that takes the URL being crawled. Defaults to Mozilla/5.0 (compatible; supercrawler/1.0; +https://github.com/brendonboshell/supercrawler) . |
request | Object of options to be passed to request. Note that request does not support an asynchronous (and distributed) cookie jar. |
Example usage:
var crawler = new supercrawler.Crawler({
interval: 1000,
concurrentRequestsLimit: 1
});
The following methods are available:
Method | Description |
---|---|
getUrlList | Get the UrlList type instance. |
getInterval | Get the interval setting. |
getConcurrentRequestsLimit | Get the maximum number of concurrent requests. |
getUserAgent | Get the user agent. |
start | Start crawling. |
stop | Stop crawling. |
addHandler(handler) | Add a handler for all content types. |
addHandler(contentType, handler) | Add a handler for a specific content type. If contentType is a string, then (for example) 'text' will match 'text/html', 'text/plain', etc. If contentType is an array of strings, the page content type must match exactly. |
The Crawler
object fires the following events:
Event | Description |
---|---|
crawlurl(url) | Fires when crawling starts with a new URL. |
crawledurl(url, errorCode, statusCode, errorMessage) | Fires when crawling of a URL is complete. errorCode is null if no error occurred. statusCode is set if and only if the request was successful. errorMessage is null if no error occurred. |
urllistempty | Fires when the URL list is (intermittently) empty. |
urllistcomplete | Fires when the URL list is permanently empty, barring URLs added by external sources. This only makes sense when running Supercrawler in non-distributed fashion. |
DbUrlList
is a queue backed with a database, such as MySQL, Postgres or SQLite. You can use any database engine supported by Sequelize.
If a request fails, this queue will ensure the request gets retried at some point in the future. The next request is schedule 1 hour into the future. After that, the period of delay doubles for each failure.
Options:
Option | Description |
---|---|
opts.db.database | Database name. |
opts.db.username | Database username. |
opts.db.password | Database password. |
opts.db.sequelizeOpts | Options to pass to sequelize. |
opts.db.table | Table name to store URL queue. Default = 'url' |
opts.recrawlInMs | Number of milliseconds to recrawl a URL. Default = 31536000000 (1 year) |
Example usage:
new supercrawler.DbUrlList({
db: {
database: "crawler",
username: "root",
password: "password",
sequelizeOpts: {
dialect: "mysql",
host: "localhost"
}
}
})
The following methods are available:
Method | Description |
---|---|
insertIfNotExists(url) | Insert a Url object. |
upsert(url) | Upsert Url object. |
getNextUrl() | Get the next Url to be crawled. |
RedisUrlList
is a queue backed with Redis.
If a request fails, this queue will ensure the request gets retried at some point in the future. The next request is schedule 1 hour into the future. After that, the period of delay doubles for each failure.
It also balances requests between different hostnames. So, for example, if you crawl a sitemap file with 10,000 URLs, the next 10,000 URLs will not be stuck in the same host.
Options:
Option | Description |
---|---|
opts.redis | Options passed to ioredis. |
opts.delayHalfLifeMs | Hostname delay factor half-life. Requests are delayed by an amount of time proportional to the number of pages crawled for a hostname, but this factor exponentially decays over time. Default = 3600000 (1 hour). |
opts.expiryTimeMs | Amount of time before recrawling a successful URL. Default = 2592000000 (30 days). |
opts.initialRetryTimeMs | Amount of time to wait before first retry after a failed URL. Default = 3600000 (1 hour) |
Example usage:
new supercrawler.RedisUrlList({
redis: {
host: "127.0.0.1"
}
})
The following methods are available:
Method | Description |
---|---|
insertIfNotExists(url) | Insert a Url object. |
upsert(url) | Upsert Url object. |
getNextUrl() | Get the next Url to be crawled. |
The FifoUrlList
is the default URL queue powering the crawler. You can add
URLs to the queue, and they will be crawled in the same order (FIFO).
Note that, with this queue, URLs are only crawled once, even if the request
fails. If you need retry functionality, you must use DbUrlList
.
The following methods are available:
Method | Description |
---|---|
insertIfNotExists(url) | Insert a Url object. |
upsert(url) | Upsert Url object. |
getNextUrl() | Get the next Url to be crawled. |
A Url
represents a URL to be crawled, or a URL that has already been
crawled. It is uniquely identified by an absolute-path URL, but also contains
information about errors and status codes.
Option | Description |
---|---|
url | Absolute-path string url |
statusCode | HTTP status code or null . |
errorCode | String error code or null . |
Example usage:
var url = new supercrawler.Url({
url: "https://example.com"
});
You can also call it just a string URL:
var url = new supercrawler.Url("https://example.com");
The following methods are available:
Method | Description |
---|---|
getUniqueId | Get the unique identifier for this object. |
getUrl | Get the absolute-path string URL. |
getErrorCode | Get the error code, or null if it is empty. |
getStatusCode | Get the status code, or null if it is empty. |
A function that returns a handler which parses a HTML page and identifies any links.
Option | Description |
---|---|
hostnames | Array of hostnames that are allowed to be crawled. |
urlFilter(url, pageUrl) | Function that takes a URL and returns true if it should be included. |
Example usage:
var hlp = supercrawler.handlers.htmlLinkParser({
hostnames: ["example.com"]
});
var hlp = supercrawler.handlers.htmlLinkParser({
urlFilter: function (url) {
return url.indexOf("page1") === -1;
}
});
A function that returns a handler which parses a robots.txt file. Robots.txt
file are automatically crawled, and sent through the same content handler
routines as any other file. This handler will look for any Sitemap:
directives,
and add those XML sitemaps to the crawl.
It will ignore any files that are not /robots.txt
.
If you want to extract the URLs from those XML sitemaps, you will also need to add a sitemap parser.
Option | Description |
---|---|
urlFilter(sitemapUrl, robotsTxtUrl) | Function that takes a URL and returns true if it should be included. |
Example usage:
var rp = supercrawler.handlers.robotsParser();
crawler.addHandler("text/plain", supercrawler.handlers.robotsParser());
A function that returns a handler which parses an XML sitemaps file. It will
pick up any URLs matching sitemapindex > sitemap > loc, urlset > url > loc
.
It will also handle a gzipped file, since that it part of the sitemaps specification.
Option | Description |
---|---|
urlFilter | Function that takes a URL (including sitemap entries) and returns true if it should be included. |
Example usage:
var sp = supercrawler.handlers.sitemapsParser();
crawler.addHandler(supercrawler.handlers.sitemapsParser());
crawledurl
event to contain the error message, thanks hjr3.sitemapsParser
to apply urlFilter
on the sitemaps entries, thanks hjr3.Crawler
to take userAgent
option as a function, thanks hjr3.Crawler#addHandler
can now take an array of content-type to match, thanks taina0407.opts.db.table
option to DbUrlList
(adversinc).recrawlInMs
option to DbUrlList
(adversinc).urlFilter
option to htmlLinkParser
(adversinc).robotsEnabled
(default true
) option to allow the
robots.txt check to be disabled (cbess).robotsIgnoreServerError
option to accept a robots.txt 500 error code as "allow all" rather than "deny all" (default), thanks cbess.htmlLinkParser
should detect links matching the area[href]
selector.crawledurl
event the crawl of a specific URL is
complete (whether successful or not).urllistcomplete
event when the UrlList is permanently
empty (compare with urllistempty
, which may fire intermittently).request
library.gzipContentTypes
option to sitemapsParser
. Example: gzipContentTypes: 'application/gzip'
and gzipContentTypes: ['application/gzip']
.redirect
, links
and httpError
events.### 0.13.1
DbUrlList
doesn't fetch the existing record from the database unless
there was an error.errorMessage
column on urls
table that gives more information
about, e.g., a handlers error that occurred.context
argument. This allows you to pass information
forwards via handlers. For example, you might cache the cheerio
parsing
so you don't parse with every content handler.handlersError
is emitted if any of the handlers
returns an error.urlHash
field to 40 characters, in case tables are using
utf8mb4
collations for strings.getNextUrl
function of DbUrlList
to use a more optimized query.DbUrlList
).Accept-Encoding: gzip, deflate
header, so the
responses arrive compressed (saving data transfer).robotsParser
function.sitemapsParser
function.<xhtml:link rel="alternate">
URLs,
in addition to the <loc>
URLs.insertIfNotExistsBulk
method which can insert
a large list of URLs into the crawl queue.DbUrlList
supports the bulk insert method.application/gzip
as well as
application/x-gzip
.urllistempty
and crawlurl
events. It also
captures the RangeError
event when the URL list is empty.htmlLinkParser
now also picks up link
tags where rel=alternate
./robots.txt
that is used for checking rules; but not for /robots.txt
added to the queue.DbUrlList
to mark a URL as taken, and ensure it never returns a URL that is being crawled in another concurrent request. This has required a new field called holdDate
on the url
table### 0.3.2
Sitemap:
directives.urlHash
SHA1 field on the url
table, to support the unique index.error.RequestError
for all errors that occur when requesting a page.DbUrlList
queue object that stores URLs in a SQL database. Includes exponetial backoff retry logic.DbUrlList
and FifoUrlList
is now via methods insertIfNotExists
, upsert
and getNextUrl
. Previously, it was just insert
(which also updated) and upsert
, but we need a way to differentiate between discovered URLs which should not update the crawl state.Crawler
object, supporting rate limiting, concurrent requests limiting, robots.txt caching.FifoUrlList
object, a first-in, first-out in-memory list of URLs to be crawled.Url
object, representing a URL in the crawl queue.htmlLinkParser
, a function to extract links from crawled HTML documents.FAQs
A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.
The npm package supercrawler receives a total of 10 weekly downloads. As such, supercrawler popularity was classified as not popular.
We found that supercrawler demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
pnpm 10 blocks lifecycle scripts by default to improve security, addressing supply chain attack risks but sparking debate over compatibility and workflow changes.
Product
Socket now supports uv.lock files to ensure consistent, secure dependency resolution for Python projects and enhance supply chain security.
Research
Security News
Socket researchers have discovered multiple malicious npm packages targeting Solana private keys, abusing Gmail to exfiltrate the data and drain Solana wallets.