Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

supercrawler

Package Overview
Dependencies
Maintainers
1
Versions
45
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

supercrawler

A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.

  • 0.3.0
  • Source
  • npm
  • Socket score

Version published
Maintainers
1
Created
Source

Supercrawler - Node.js Web Crawler

Supercrawler is a Node.js web crawler. It is designed to be highly configurable and easy to use.

Supercrawler can store state information in a database, so you an can start and stop crawls easily. It will automatically retry failed URLs in an exponential backoff style (starting at 1 hour and doubling thereafter).

When Supercrawler successfully crawls a page (which could be a image, text, etc), it will fire your custom content-type handlers. Define your own custom handlers to parse pages, save data and do anything else you need.

Features

  • Link Detection. Supercrawler will parse crawled HTML documents, identify links and add them to the queue.
  • Robots Parsing. Supercrawler will request robots.txt and check the rules before crawling. It will also identify any sitemaps.
  • Sitemaps Parsing. Supercrawler will read links from XML sitemap files, and add links to the queue.
  • Concurrency Limiting. Supercrawler limits the number of requests sent out at any one time.
  • Rate limiting. Supercrawler will add a delay between requests to avoid bombarding servers.

Step 1. Create a New Crawler

var crawler = new supercrawler.Crawler({
  interval: 100
});

Step 2. Add Content handlers

You can specify your own content handlers for all types of content or groups of content. You can target text or text/html documents easily.

The htmlLinkParser handler is included with Supercrawler. It automatically parses a HTML document, discovers links and adds them to the crawl queue. You can specify an array of allowed hostnames with the hostnames option, allowing you to easily control the scope of your crawl.

You can also specify your own handlers. Use these handlers to parse content, save files or identify links. Just return an array of links (absolute paths) from your handler, and Supercrawler will add them to the queue.

crawler.addHandler("text/html", supercrawler.handlers.htmlLinkParser({
  hostnames: ["example.com"]
}));
crawler.addHandler("text/html", function (buf, url) {
  console.log("Got page", url);
});

Step 3. Start the Crawl

Insert a starting URL into the queue, and call crawler.start().

crawler.getUrlList()
  .insertIfNotExists(new supercrawler.Url("https://example.com/"))
  .then(function () {
    return crawler.start();
  });

Crawler

Each Crawler instance represents a web crawler. You can configure your crawler with the following options:

OptionDescription
urlListCustom instance of UrlList type queue. Defaults to FifoUrlList, which processes URLs in the order that they were added to the queue; once they are removed from the queue, they cannot be recrawled.
intervalNumber of milliseconds between requests. Defaults to 1000.
concurrentRequestsLimitMaximum number of concurrent requests. Defaults to 5.
robotsCacheTimeNumber of milliseconds that robots.txt should be cached for. Defaults to 3600000 (1 hour).
userAgentUser agent to use for requests. Defaults to Mozilla/5.0 (compatible; supercrawler/1.0; +https://github.com/brendonboshell/supercrawler)

Example usage:

var crawler = new supercrawler.Crawler({
  interval: 1000,
  concurrentRequestsLimit: 1
});

The following methods are available:

MethodDescription
getUrlListGet the UrlList type instance.
getIntervalGet the interval setting.
getConcurrentRequestsLimitGet the maximum number of concurrent requests.
getUserAgentGet the user agent.
startStart crawling.
stopStop crawling.
addHandler(handler)Add a handler for all content types.
addHandler(contentType, handler)Add a handler for a specific content type.

DbUrlList

DbUrlList is a queue backed with a database, such as MySQL, Postgres or SQLite. You can use any database engine supported by Sequelize.

If a request fails, this queue will ensure the request gets retried at some point in the future. The next request is schedule 1 hour into the future. After that, the period of delay doubles for each failure.

Options:

OptionDescription
opts.db.databaseDatabase name.
opts.db.usernameDatabase username.
opts.db.passwordDatabase password.
opts.db.sequelizeOptsOptions to pass to sequelize.

Example usage:

new supercrawler.DbUrlList({
  db: {
    database: "crawler",
    username: "root",
    password: "password",
    sequelizeOpts: {
      dialect: "mysql",
      host: "localhost"
    }
  }
})

The following methods are available:

MethodDescription
insertIfNotExists(url)Insert a Url object.
upsert(url)Upsert Url object.
getNextUrl()Get the next Url to be crawled.

FifoUrlList

The FifoUrlList is the default URL queue powering the crawler. You can add URLs to the queue, and they will be crawled in the same order (FIFO).

Note that, with this queue, URLs are only crawled once, even if the request fails. If you need retry functionality, you must use DbUrlList.

The following methods are available:

MethodDescription
insertIfNotExists(url)Insert a Url object.
upsert(url)Upsert Url object.
getNextUrl()Get the next Url to be crawled.

Url

A Url represents a URL to be crawled, or a URL that has already been crawled. It is uniquely identified by an absolute-path URL, but also contains information about errors and status codes.

OptionDescription
urlAbsolute-path string url
statusCodeHTTP status code or null.
errorCodeString error code or null.

Example usage:

var url = new supercrawler.Url({
  url: "https://example.com"
});

You can also call it just a string URL:

var url = new supercrawler.Url("https://example.com");

The following methods are available:

MethodDescription
getUniqueIdGet the unique identifier for this object.
getUrlGet the absolute-path string URL.
getErrorCodeGet the error code, or null if it is empty.
getStatusCodeGet the status code, or null if it is empty.

handlers.htmlLinkParser

A function that returns a handler which parses a HTML page and identifies any links.

OptionDescription
hostnamesArray of hostnames that are allowed to be crawled.

Example usage:

var hlp = supercrawler.handlers.htmlLinkParser({
  hostnames: ["example.com"]
});

handlers.robotsParser

A function that returns a handler which parses a robots.txt file. Robots.txt file are automatically crawled, and sent through the same content handler routines as any other file. This handler will look for any Sitemap: directives, and add those XML sitemaps to the crawl.

It will ignore any files that are not /robots.txt.

If you want to extract the URLs from those XML sitemaps, you will also need to add a sitemap parser.

Example usage:

var rp = supercrawler.handlers.robotsParser();
crawler.addHandler("text/plain", supercrawler.handlers.robotsParser());

handlers.sitemapsParser

A function that returns a handler which parses an XML sitemaps file. It will pick up any URLs matching sitemapindex > sitemap > loc, urlset > url > loc.

It will also handle a gzipped file, since that it part of the sitemaps specification.

Example usage:

var sp = supercrawler.handlers.sitemapsParser();
crawler.addHandler(supercrawler.handlers.sitemapsParser());

Keywords

FAQs

Package last updated on 18 Jul 2016

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc