Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

scrape-it

Package Overview
Dependencies
Maintainers
1
Versions
48
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

scrape-it

A Node.js scraper for humans.

  • 3.3.0
  • Source
  • npm
  • Socket score

Version published
Weekly downloads
3.8K
decreased by-10.17%
Maintainers
1
Weekly downloads
 
Created
Source

scrape-it

scrape-it

Patreon PayPal AMA Travis Version Downloads Get help on Codementor

A Node.js scraper for humans.

:cloud: Installation

$ npm i --save scrape-it

:clipboard: Example

const scrapeIt = require("scrape-it");

// Promise interface
scrapeIt("http://ionicabizau.net", {
    title: ".header h1"
  , desc: ".header h2"
  , avatar: {
        selector: ".header img"
      , attr: "src"
    }
}).then(page => {
    console.log(page);
});

// Callback interface
scrapeIt("http://ionicabizau.net", {
    // Fetch the articles
    articles: {
        listItem: ".article"
      , data: {

            // Get the article date and convert it into a Date object
            createdAt: {
                selector: ".date"
              , convert: x => new Date(x)
            }

            // Get the title
          , title: "a.article-title"

            // Nested list
          , tags: {
                listItem: ".tags > span"
            }

            // Get the content
          , content: {
                selector: ".article-content"
              , how: "html"
            }
        }
    }

    // Fetch the blog pages
  , pages: {
        listItem: "li.page"
      , name: "pages"
      , data: {
            title: "a"
          , url: {
                selector: "a"
              , attr: "href"
            }
        }
    }

    // Fetch some other data from the page
  , title: ".header h1"
  , desc: ".header h2"
  , avatar: {
        selector: ".header img"
      , attr: "src"
    }
}, (err, page) => {
    console.log(err || page);
});
// { articles:
//    [ { createdAt: Mon Mar 14 2016 00:00:00 GMT+0200 (EET),
//        title: 'Pi Day, Raspberry Pi and Command Line',
//        tags: [Object],
//        content: '<p>Everyone knows (or should know)...a" alt=""></p>\n' },
//      { createdAt: Thu Feb 18 2016 00:00:00 GMT+0200 (EET),
//        title: 'How I ported Memory Blocks to modern web',
//        tags: [Object],
//        content: '<p>Playing computer games is a lot of fun. ...' },
//      { createdAt: Mon Nov 02 2015 00:00:00 GMT+0200 (EET),
//        title: 'How to convert JSON to Markdown using json2md',
//        tags: [Object],
//        content: '<p>I love and ...' } ],
//   pages:
//    [ { title: 'Blog', url: '/' },
//      { title: 'About', url: '/about' },
//      { title: 'FAQ', url: '/faq' },
//      { title: 'Training', url: '/training' },
//      { title: 'Contact', url: '/contact' } ],
//   title: 'Ionică Bizău',
//   desc: 'Web Developer,  Linux geek and  Musician',
//   avatar: '/images/logo.png' }

:memo: Documentation

scrapeIt(url, opts, cb)

A scraping module for humans.

Params
  • String|Object url: The page url or request options.
  • Object opts: The options passed to scrapeHTML method.
  • Function cb: The callback function.
Return
  • Promise A promise object.

scrapeIt.scrapeHTML($, opts)

Scrapes the data in the provided element.

Params
  • Cheerio $: The input element.

  • Object opts: An object containing the scraping information. If you want to scrape a list, you have to use the listItem selector:

    • listItem (String): The list item selector.
    • data (Object): The fields to include in the list objects:
      • <fieldName> (Object|String): The selector or an object containing:
        • selector (String): The selector.
        • convert (Function): An optional function to change the value.
        • how (Function|String): A function or function name to access the value.
        • attr (String): If provided, the value will be taken based on the attribute name.
        • trim (Boolean): If false, the value will not be trimmed (default: true).
        • closest (String): If provided, returns the first ancestor of the given element.
        • eq (Number): If provided, it will select the nth element.
        • listItem (Object): An object, keeping the recursive schema of the listItem object. This can be used to create nested lists.

    Example:

    {
       articles: {
           listItem: ".article"
         , data: {
               createdAt: {
                   selector: ".date"
                 , convert: x => new Date(x)
               }
             , title: "a.article-title"
             , tags: {
                   listItem: ".tags > span"
               }
             , content: {
                   selector: ".article-content"
                 , how: "html"
               }
             , traverseOtherNode: {
                   selector: ".upperNode"
                 , closest: "div"
                 , convert: x => x.length
               }
           }
       }
    }
    

    If you want to collect specific data from the page, just use the same schema used for the data field.

    Example:

    {
         title: ".header h1"
       , desc: ".header h2"
       , avatar: {
             selector: ".header img"
           , attr: "src"
         }
    }
    
Return
  • Object The scraped data.

:yum: How to contribute

Have an idea? Found a bug? See how to contribute.

:moneybag: Donations

Another way to support the development of my open-source modules is to set up a recurring donation, via Patreon. :rocket:

PayPal donations are appreciated too! Each dollar helps.

Thanks! :heart:

:dizzy: Where is this library used?

If you are using this library in one of your projects, add it in this list. :sparkles:

  • 3abn—A 3ABN radio client in the terminal.
  • bandcamp-scraper (by Simon Thiboutôt)—A scraper for https://bandcamp.com
  • cevo-lookup (by Zack Boehm)—Searchs the CEVO Suspension List for bans by SteamID
  • codementor—A scraper for codementor.io.
  • degusta-scrapper (by yohendry hurtado)—desgusta scrapper for alexa skill
  • proxylist (by self_refactor)—Get free proxy list
  • rs-api (by Alex Kempf)—Simple wrapper for RuneScape APIs written in node.
  • sahibinden (by Cagatay Cali)—Simple sahibinden.com bot
  • sahibindenServer (by Cagatay Cali)—Simple sahibinden.com bot server side
  • sgdq-collector (by Benjamin Congdon)—Collects Twitch / Donation information and pushes data to Firebase
  • trump-cabinet-picks (by Linda Haviv)—NYT cabinet predictions for Trump admin.
  • ubersetzung (by self_refactor)—translate words with examples from German to English
  • ui-studentsearch (by Rakha Kanz Kautsar)—API for majapahit.cs.ui.ac.id/studentsearch

:scroll: License

MIT © Ionică Bizău

Keywords

FAQs

Package last updated on 28 Feb 2017

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc