Socket
Book a DemoInstallSign in
Socket

json-web-crawler

Package Overview
Dependencies
Maintainers
1
Versions
20
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

json-web-crawler

Crawl website by json

latest
Source
npmnpm
Version
0.8.3
Version published
Maintainers
1
Created
Source

Json Web Crawler

NPM version Node version Open Source Love

Use JSON to list all elements (with css 3 and jquery selector) that you want to crawl.

Demo

Usage

npm i json-web-crawler --save
const crawl = require('json-web-crawler');

crawl('HTML content', your json setting)
  .then(console.log)
  .catch(console.error);

Settings

type

The default type is content.

  • content: crawl specific $container to a single json.
  • list: crawl a list like Google search result into multi data.

container

DOM element that will focus on. If type is list, it will crawl each container class.

listOption

Optional, enable in list type only, use when you don't want to crawl the whole list. ** ALL STRAT FROM 0 **

  • [ 'limit', 10 ]: ten elements only (eq(0) ~ eq(9)).
  • [ 'range', 6, 12 ]: from eq(6) to eq(12 - 1). If without end, it will continue to the last one.
  • [ 'focus', 0, 3, 7, ... ]: specific elements in list (eq(0), eq(3), eq(7), ...). You can use -1, -2 to count from backward.
  • [ 'ignore', 1, 2, 5 ]: elements you want to ignore it. You can use -1, -2 to count from backward.

crawl

keyName: { options } => keyName: data

crawl: {
  image: {
    elem: 'img',
    get: 'src'
  }
}

// will become
image: IMAGE_SRC_URL

options

  • elem: element inside container. If empty or undefined, it will use container or listElems instead
  • noChild (boolean): remove all children elem under $(elem)
  • outOfContainer (boolean): If exist, It will use $('html').find(elem)
  • get: return type of element
    • text
    • num
    • length: $element.length
    • attrName: $element.attr('attrName')
    • data-dataName: $element.data('dataNAme')
    • data-dataName:X: X is optional.
      • If data is an array, set data-dataName:0 will return $elem.data('dataAttribute')[0].
      • If data is an object, set data-dataName:id will return $elem.data('dataAttribute')['id'].
      • If X not exist, it will return the whole data.
  • process: If you want to do something else after 'get' (string type only)
// You can use some simple functions that existed in lodash.
process: [
  ['match', /regex here/, number],  // => str.match(/regex here/)[number], return array if no number, but will cause other process won't work
  ['split', ',', number],           // => str.split(',')[number], return array if no number, but will cause other process won't work
  ['replace', 'one', 'two'],
  ['substring', 0, 3],
  ['prepend', 'text'],              // => 'text' + value
  ['append', 'text'],               // => value + 'text'
  ['indexOf', 'text']               // => return number
  ['INDENPENDENT_FUNCTION'],        // like encodeURI, encodeURIComponent, unescape, etc...
  /**
    * Due to lodash has the same name `escape` & `unescape` functions with
    * different behavior, the origin `escape` & `unescape` function will
    * renamed to `encode` & `decode` instead.
    */
],

// Or you want to DIY, you can use function instead
process(value, $elem /* jquery dom */) {
  // do something

  return newValue;
}
  • collect: If the value you want is sperated to several elements, use collect to find them all.

    • elems: contain multi elements array.
    • loop (boolean): It will run all elems (like li) you want to get
    • combineWith: without this, collect will return array
  • default: return default value when elem not found, null or undefined (process will be ignored)

pageNotFound

If match, it will return page not found error.

  • elem
  • get
  • check: like process, but only one step

Example

Content Type

Steam Dota2 page in demo.

const setting = {
  type: 'content',
  container: '#game_highlights .rightcol',
  crawl: {
    appId: {
      elem: '.glance_tags',
      get:  'data-appid'
    },
    appName: {
      outOfContainer: true,
      elem: '.apphub_AppName',
      get:  'text'
    },
    image: {
      elem: '.game_header_image_full',
      get:  'src'
    },
    reviews: {
      elem: '.game_review_summary:eq(0)',
      get:  'text',
    },
    tags: {
      elem: '.glance_tags',
      collect: {
        elems: [{
          elem: 'a.app_tag:eq(0)',
          get:  'text'
        }, {
          elem: 'a.app_tag:eq(1)',
          get:  'text'
        }, {
          elem: 'a.app_tag:eq(2)',
          get:  'text'
        }],
        combineWith: ', '
      }
    },
    allTags: {
      elem: '.glance_tags a.app_tag',
      collect: {
        loop: true,
        get:  'text',
        combineWith: ', '
      }
    },
    description: {
      elem: '.game_description_snippet',
      get:  'text',
      process(value, $elem) {
        return value.split(', ');
      }
    },
    releaseDate: {
      elem: '.release_date .date',
      get:  'text'
    }
  }
};

List Type

KickStarter popular list in demo.

const setting = {
  pageNotFound: [{
    elem: '.grey-frame-inner h1',
    get:  'text',
    check: ['equal', '404']
  }],
  type: 'list',
  container: '#projects_list .project-card',
  listOption: [ 'limit', 3 ],
  // listOption: [ 'range', 0, 10 ],
  // listOption: [ 'ignore', 0, 2, -1 ],
  // listOption: [ 'focus', 3, -3 ],
  crawl: {
    projectID: {
      get: 'data-pid',
    },
    name: {
      elem: '.project-title',
      get:  'text',
    },
    image: {
      elem: '.project-thumbnail img',
      get:  'src'
    },
    link: {
      elem: '.project-title a',
      get:  'href',
      process: [
        [ 'split', '?', 0 ],
        [ 'prepend', 'https://www.kickstarter.com' ]
      ]
    },
    description: {
      elem: '.project-blurb',
      get:  'text'
    },
    funded: {
      elem: '.project-stats-value:eq(0)',
      get:  'text'
    },
    percentPledged: {
      elem: '.project-percent-pledged',
      get:  'style',
      process: [
        [ 'split', /:\s?/g, 1 ]
      ]
    },
    pledged: {
      elem: '.money.usd',
      get:  'num'
    }
  }
};

Keywords

json

FAQs

Package last updated on 20 Sep 2022

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

About

Packages

Stay in touch

Get open source security insights delivered straight into your inbox.

  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc

U.S. Patent No. 12,346,443 & 12,314,394. Other pending.