Socket
Socket
Sign inDemoInstall

cheers2

Package Overview
Dependencies
20
Maintainers
1
Versions
3
Alerts
File Explorer

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

    cheers2

Scrape a website efficiently, block by block, page by page. Based on cheerio and cURL.


Version published
Maintainers
1
Install size
2.33 MB
Created

Readme

Source

Cheers

Scrape a website efficiently, block by block, page by page.

Motivations

This is a Cheerio based scraper, useful to extract data from a website using CSS selectors.
The motivation behind this package is to provide a simple cheerio-based scraping tool, able to divide a website into blocks, and transform each block into a JSON object using CSS selectors.

Built on top of the excellents :

https://github.com/cheeriojs/cheerio
https://github.com/chriso/curlrequest
https://github.com/kriskowal/q

CSS mapping syntax inspired by :

https://github.com/dharmafly/noodle

Getting Started

Install the module with: npm install cheers

Usage

Configuration options:

  • config.url : the URL to scrape
  • config.blockSelector : the CSS selector to apply on the page to divide it in scraping blocks. This field is optional (will use "body" by default)
  • config.scrape : the definition of what you want to extract in each block. Each key has two mandatory attributes : selector (a CSS selector or . to stay on the current node) and extract. The possible values for extract are text, html, outerHTML, a RegExp or the name of an attribute of the html element (e.g. "href")
var cheers = require('cheers');

//let's scrape this excellent JS news website
var config = {
    url: "http://www.echojs.com/",
    blockSelector: "article",
    scrape: {
        title: {
            selector: "h2 a",
            extract: "text"
        },
        link: {
            selector: "h2 a",
            extract: "href"
        },
        articleInnerHtml: {
            selector: ".",
            extract: "html"
        },
        articleOuterHtml: {
            selector: ".",
            extract: "outerHTML"
        },
        articlePublishedTime: {
            selector: 'p',
            extract: /\d* (?:hour[s]?|day[s]?) ago/
        }
    }
};

cheers.scrape(config).then(function (results) {
    console.log(JSON.stringify(results));
}).catch(function (error) {
    console.error(error);
});

Roadmap

  • Option to use request instead of curl
  • Option to change the user agent
  • Command line tool
  • Website pagination
  • Option to use a headless browser
  • Unit tests

Contributors

Cheers!

License

Copyright (c) 2014 Fabien Allanic
Licensed under the MIT license.

Keywords

FAQs

Last updated on 08 Jan 2015

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc