Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

supercrawler

Package Overview
Dependencies
Maintainers
1
Versions
45
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

supercrawler - npm Package Compare versions

Comparing version 0.3.0 to 0.3.1

.travis.yml

2

package.json
{
"name": "supercrawler",
"description": "A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.",
"version": "0.3.0",
"version": "0.3.1",
"homepage": "https://github.com/brendonboshell/supercrawler",

@@ -6,0 +6,0 @@ "author": "Brendon Boshell <brendonboshell@gmail.com>",

@@ -1,9 +0,14 @@

# Supercrawler - Node.js Web Crawler
# Node.js Web Crawler
[![npm](https://img.shields.io/npm/v/supercrawler.svg?maxAge=2592000)]()
[![npm](https://img.shields.io/npm/l/supercrawler.svg?maxAge=2592000)]()
[![GitHub issues](https://img.shields.io/github/issues/brendonboshell/supercrawler.svg?maxAge=2592000)]()
[![David](https://img.shields.io/david/brendonboshell/supercrawler.svg?maxAge=2592000)]()
[![David](https://img.shields.io/david/dev/brendonboshell/supercrawler.svg?maxAge=2592000)]()
[![Travis](https://img.shields.io/travis/brendonboshell/supercrawler.svg?maxAge=2592000)]()
Supercrawler is a Node.js web crawler. It is designed to be highly configurable and easy to use.
Supercrawler can store state information in a database, so you an can start and stop crawls easily. It will automatically retry failed URLs in an exponential backoff style (starting at 1 hour and doubling thereafter).
When Supercrawler successfully crawls a page (which could be an image, a text document or any other file), it will fire your custom content-type handlers. Define your own custom handlers to parse pages, save data and do anything else you need.
When Supercrawler successfully crawls a page (which could be a image, text, etc), it will fire your custom content-type handlers. Define your own custom handlers to parse pages, save data and do anything else you need.
## Features

@@ -21,40 +26,94 @@

bombarding servers.
* **Exponential Backoff Retry**. Supercrawler will retry failed requests after 1 hour, then 2 hours, then 4 hours, etc. To use this feature, you must use the database-backed crawl queue.
## Step 1. Create a New Crawler
## How It Works
var crawler = new supercrawler.Crawler({
interval: 100
});
**Crawling** is controlled by the an instance of the `Crawler` object, which acts like a web client. It is responsible for coordinating with the *priority queue*, sending requests according to the concurrency and rate limits, checking the robots.txt rules and despatching content to the custom *content handlers* to be processed. Once started, it will automatically crawl pages until you ask it to stop.
## Step 2. Add Content handlers
The **Priority Queue** or **UrlList** keeps track of which URLs need to be crawled, and the order in which they are to be crawled. The Crawler will pass new URLs discovered by the content handlers to the priority queue. When the crawler is ready to crawl the next page, it will call the `getNextUrl` method. This method will work out which URL should be crawled next, based on implementation-specific rules. Any retry logic is handled by the queue.
You can specify your own content handlers for all types of content or groups
of content. You can target `text` or `text/html` documents easily.
The **Content Handlers** are functions which take content buffers and do some further processing with them. You will almost certainly want to create your own content handlers to analyze pages or store data, for example. The content handlers tell the Crawler about new URLs that should be crawled in the future. Supercrawler provides content handlers to parse links from HTML pages, analyze robots.txt files for `Sitemap:` directives and parse sitemap files for URLs.
The `htmlLinkParser` handler is included with Supercrawler. It automatically
parses a HTML document, discovers links and adds them to the crawl queue. You
can specify an array of allowed hostnames with the `hostnames` option, allowing
you to easily control the scope of your crawl.
## Get Started
You can also specify your own handlers. Use these handlers to parse content,
save files or identify links. Just return an array of links (absolute paths)
from your handler, and Supercrawler will add them to the queue.
First, install Supercrawler.
crawler.addHandler("text/html", supercrawler.handlers.htmlLinkParser({
hostnames: ["example.com"]
}));
crawler.addHandler("text/html", function (buf, url) {
console.log("Got page", url);
});
```
npm install supercrawler --save
```
## Step 3. Start the Crawl
Second, create an instance of `Crawler`.
Insert a starting URL into the queue, and call `crawler.start()`.
```js
var supercrawler = require("supercrawler");
crawler.getUrlList()
.insertIfNotExists(new supercrawler.Url("https://example.com/"))
.then(function () {
return crawler.start();
});
// 1. Create a new instance of the Crawler
// object, providing configuration details.
// Note that configuration cannot be changed
// after the object is created.
var crawler = new supercrawler.Crawler({
// By default, Supercrawler uses a simple
// FIFO queue, which doesn't support
// retries or memory of rawl state. For
// any non-trivial crawl, you should
// create a database. Provide your database
// config to the constructor of DbUrlList.
urlList: new supercrawler.DbUrlList({
db: {
database: "crawler",
username: "root",
password: secrets.db.password,
sequelizeOpts: {
dialect: "mysql",
host: "localhost"
}
}
}),
// Tme (ms) between requests
interval: 1000,
// Maximum number of requests at any
// one time.
concurrentRequestsLimit: 5,
// Time (ms) to cache the results
// of robots.txt queries.
robotsCacheTime: 3600000,
// Query string to use during the crawl.
userAgent: "Mozilla/5.0 (compatible; supercrawler/1.0; +https://github.com/brendonboshell/supercrawler)"
});
```
Third, add some content handlers.
```js
// Get "Sitemaps:" directives from robots.txt
crawler.addHandler(supercrawler.handlers.robotsParser());
// Crawl sitemap files and extract their URLs.
crawler.addHandler(supercrawler.handlers.sitemapsParser());
// Pick up <a href> links from HTML documents
crawler.addHandler("text/html", supercrawler.handlers.htmlLinkParser({
// Restrict discovered links to the following hostnames.
hostnames: ["example.com"]
}));
// Custom content handler for HTML pages.
crawler.addHandler("text/html", function (buf, url) {
var sizeKb = Buffer.byteLength(buf) / 1024;
logger.info("Processed", url, "Size=", sizeKb, "KB");
});
```
Fourth, add a URL to the queue and start the crawl.
```js
crawler.getUrlList()
.insertIfNotExists(new supercrawler.Url("http://example.com/"))
.then(function () {
return crawler.start();
});
```
That's it! Supercrawler will handle the crawling for you. You only have to define your custom behaviour in the content handlers.
## Crawler

@@ -212,3 +271,3 @@

# handlers.sitemapsParser
## handlers.sitemapsParser

@@ -215,0 +274,0 @@ A function that returns a handler which parses an XML sitemaps file. It will

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc