scrape-it

Dependencies

Maintainers

Versions

Alerts

File Explorer

Advanced tools

License

Install Socket

Detect and block malicious and high-risk dependencies

Install

scrape-it

A Node.js scraper for humans.

1.0.0
Source
npm

Version published: 8 years ago

Weekly downloads: 0

Maintainers: 1

Weekly downloads

Created: 8 years ago

Source

scrape-it

A Node.js scraper for humans.

:cloud: Installation

$ npm i --save scrape-it

:clipboard: Example

const scrapeIt = require("scrape-it");

scrapeIt("http://ionicabizau.net", [
    // Fetch the articles on the page (list)
    {
        listItem: ".article"
      , name: "articles"
      , data: {
            createdAt: {
                selector: ".date"
              , convert: x => new Date(x)
            }
          , title: "a.article-title"
          , tags: {
                selector: ".tags"
              , convert: x => x.split("|").map(c => c.trim()).slice(1)
            }
          , content: {
                selector: ".article-content"
              , how: "html"
            }
        }
    }
  , {
        listItem: "li.page"
      , name: "pages"
      , data: {
            title: "a"
          , url: {
                selector: "a"
              , attr: "href"
            }
        }
    }
    // Fetch some additional data
  , {
        title: ".header h1"
      , desc: ".header h2"
      , avatar: {
            selector: ".header img"
          , attr: "src"
        }
  }
], (err, page) => {
    console.log(err || page);
});
// { articles:
//    [ { createdAt: Mon Mar 14 2016 00:00:00 GMT+0200 (EET),
//        title: 'Pi Day, Raspberry Pi and Command Line',
//        tags: [Object],
//        content: '<p>Everyone knows (or should know)...a" alt=""></p>\n' },
//      { createdAt: Thu Feb 18 2016 00:00:00 GMT+0200 (EET),
//        title: 'How I ported Memory Blocks to modern web',
//        tags: [Object],
//        content: '<p>Playing computer games is a lot of fun. ...' },
//      { createdAt: Mon Nov 02 2015 00:00:00 GMT+0200 (EET),
//        title: 'How to convert JSON to Markdown using json2md',
//        tags: [Object],
//        content: '<p>I love and ...' } ],
//   pages:
//    [ { title: 'Blog', url: '/' },
//      { title: 'About', url: '/about' },
//      { title: 'FAQ', url: '/faq' },
//      { title: 'Training', url: '/training' },
//      { title: 'Contact', url: '/contact' } ],
//   title: 'Ionică Bizău',
//   desc: 'Web Developer,  Linux geek and  Musician',
//   avatar: '/images/logo.png' }

:memo: Documentation

`scrapeIt(url, opts, cb)`

A scraping module for humans.

Params

String|Object url: The page url or request options.
Object|Array opts: The options passed to scrapeCheerio method.
Function cb: The callback function.

Return

Tinyreq The request object.

`scrapeIt.scrapeCheerio($input, opts, $)`

Scrapes the data in the provided element.

Params

Cheerio $input: The input element.
Object opts: An array or object containing the scraping information. If you want to scrape a list, you have to use the listItem selector:
- listItem (String): The list item selector.
- name (String): The list name (e.g. articles).
- data (Object): The fields to include in the list objects:
  - <fieldName> (Object|String): The selector or an object containing:
    - selector (String): The selector.
    - convert (Function): An optional function to change the value.
    - how (Function|String): A function or function name to access the value.
    - attr (String): If provided, the value will be taken based on the attribute name.
    - trim (Boolean): If false, the value will not be trimmed (default: true).
    - eq (Number): If provided, it will select the nth element.
    - listItem (Object): An object, keeping the recursive schema of the listItem object. This can be used to create nested lists.
Example:
```
{
    listItem: ".article"
  , name: "articles"
  , data: {
        createdAt: {
            selector: ".date"
          , convert: x => new Date(x)
        }
      , title: "a.article-title"
      , tags: {
            selector: ".tags"
          , convert: x => x.split("|").map(c => c.trim()).slice(1)
        }
      , content: {
            selector: ".article-content"
          , how: "html"
        }
    }
}
```
If you want to collect specific data from the page, just use the same schema used for the data field.

Example:
```
{
     title: ".header h1"
   , desc: ".header h2"
   , avatar: {
         selector: ".header img"
       , attr: "src"
     }
}
```
Function $: The Cheerio function.

Return

Object The scrapped data.

:yum: How to contribute

Have an idea? Found a bug? See how to contribute.

:scroll: License

Keywords

FAQs

What is scrape-it?

Is scrape-it popular?

Is scrape-it well maintained?

Package last updated on 28 Apr 2016

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

scrape-it

scrape-it

:cloud: Installation

:clipboard: Example

:memo: Documentation

scrapeIt(url, opts, cb)

Params

Return

scrapeIt.scrapeCheerio($input, opts, $)

Params

Return

:yum: How to contribute

:scroll: License

Keywords

Related posts

Highlights from the 2024 Rails Community Survey

Combatting Alert Fatigue by Prioritizing Malicious Intent

`scrapeIt(url, opts, cb)`

`scrapeIt.scrapeCheerio($input, opts, $)`