Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

pauk

Package Overview
Dependencies
Maintainers
1
Versions
2
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

pauk

web crawler

  • 0.0.2
  • latest
  • Source
  • npm
  • Socket score

Version published
Weekly downloads
2
decreased by-33.33%
Maintainers
1
Weekly downloads
 
Created
Source

pauk

Basic Node.js web crawler.

Install

$ npm install pauk

Example:

var Pauk = require('pauk'),
    pauk = new Pauk({maxRequests: 10});

pauk.onFinish = function(cache) {
    cache.forEach(function(v, uri) {
        if (v.error) console.log("Error:\n" + v.error);
	    else console.log("Links:\n" + v.links.join("\n"), "\nCrawled Parents:\n" + v.parents.join("\n") + "\n");
    });
};

pauk.crawl('http://example.com');

Test

$ grunt

API

new Pauk(config)

Constructor. Configuration defaults:

www: true, // if true, www.example.com will resolve to example.com
maxRequests: 5, // maximum number of total requests
ignoreQuery: true // if true, query part of the URI will be ignored

Object passed to the constructor will be merged to default configuration object.

Example:

var pauk = new Pauk({maxRequests: 55});

crawl(uri, parent)

Main method that will crawl the webpage.

  • uri - URI to be crawled
  • parent - URI of the page that contains link to uri

Example:

var pauk = new Pauk();
pauk.crawl('http://example.com');

onFinish

Called when crawling is finished. Parameter passed to this function is object containing the crawling data. Keys of this object are normalized URIs and values are objects containing:

{
    error: 'string', // (optional) in case of an error
    assets: {
        images: [], // values of src attributes of 'img' tags
        scripts: [], // values of src attributes of 'script' tags
        css: [], // values of href attributes of 'link' tags
        other: [] // values of href attributes of 'a' tags that are not URIs
    },
    links: [], // values of href attributes of 'a' tags (relative are converted to absolute)
    external: [] // values of href attributes of 'a' tags that are external links
}

Example:

var pauk = new Pauk();
pauk.onFinish = function(cache) {
    cache.forEach(function(v, uri) {
        if (v.error) console.log(uri + "\n" + v.error);
	    else console.log(uri + ": OK");
    });
}
pauk.crawl('http://example.com');

Keywords

FAQs

Package last updated on 27 May 2014

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc