Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More →

crawler-js

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

crawler-js

Opensource web data extraction framework for Node.js

1.0.1
Source
npm

Version published: 9 years ago

Weekly downloads: 2; decreased by-33.33%

Maintainers: 1

Weekly downloads

Created: 10 years ago

Source

Open source module for crawling web pages for information and data. This module is extremely extensible and pluggable, feel free to play with it. [Visit crawlerjs.org for more info][0].

Examples

In this example we will be searching StackOverflow for the most recently questions about Node.js, other examples can be found here.

var Crawler = require('crawler-js');
var manifest = {
  target: {
    // Here we need some infos to acess the website
    url: 'https://stackoverflow.com/questions/tagged/node.js',
    headers: ['User-Agent', 'CrawlerJS']
  },
  extractors: [{
    // The name for this extractor
    name: 'questions',
    // Add a root key, so you can iterate over every element
    // found with this xPath expression
    root: '//div[@class="question-summary"]',
    fields: {
      // Those xPath expressions will be based
      // on the root
      question: '//a[1]/text()',
      link: '//a[1]/@href'
    }
  }, {
    name: 'count',
    fields: {
      // We don't have a root element, so this expression
      // will have all the document as the root and will
      // not iterate
      count: '//*[@id="sidebar"]/div[1]/div[1]/text()',
    }
  }]
};

// Create a new job for your manifest
var job = new Crawler(manifest);

// Listen for data extracted
job.on('data', function(data) {
  console.log('Data extracted for %s:', data.name);
  console.log(data.data);
});

// If something bad happens..
job.on('error', function(err) {
  throw err;
});

// This way we know that the job is done
job.on('end', function() {
  console.log('The job is done');
});


// Start our job
job.start();

Testing

Testing is easy, you need to have grunt-cli installed globally, clone this repository, npm install inside the folder and run grunt test.

Todos

Add inline comments inside the code.
Propper documentation.
API reference.
More tests.

License

Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

Keywords

FAQs

What is crawler-js?

Is crawler-js popular?

Is crawler-js well maintained?

Package last updated on 24 Jun 2015

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

crawler-js

Examples

Testing

Todos

License

Keywords

Related posts

Roblox Developers Targeted with npm Packages Infected with Skuld Infostealer and Blank Grabber

vlt Debuts New JavaScript Package Manager and Serverless Registry at NodeConf EU