🚀 Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more
Socket
DemoInstallSign in
Socket

microcrawler

Package Overview
Dependencies
Maintainers
1
Versions
35
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

microcrawler

Micro implementation of crawler

0.1.30
latest
Source
npm
Version published
Weekly downloads
21
-72%
Maintainers
1
Weekly downloads
 
Created
Source

microcrawler

Status

npm version Dependency Status Code Climate Coverage Status Build Status Downloads

NPM

Screenshots

Available Official Crawlers

List of official publicly available crawlers.

Missing something? Feel free to open issue.

Prerequisites

Installation

From npmjs.org (the easy way)

This is the easiest way. The prerequisites still needs to be satisfied.

npm install -g microcrawler

From Sources

This is useful if you want to tweak the source code, implement new crawler, etc.

# Clone repository
git clone https://github.com/ApolloCrawler/microcrawler.git

# Enter folder
cd microcrawler

# Install required packages - dependencies
npm install

# Install from local sources
npm install -g .

Usage

Show available commands

$ microcrawler

  Usage: microcrawler [options] [command]


  Commands:

    collector [args]  Run data collector
    config [args]     Run config
    exporter [args]   Run data exporter
    worker [args]     Run crawler worker
    crawl [args]      Crawl specified site
    help [cmd]        display help for [cmd]

  Options:

    -h, --help     output usage information
    -V, --version  output the version number

Check microcrawler version

$ microcrawler --version
0.1.27

Initialize config file

$ microcrawler config init
2016-09-03T10:45:13.105Z - info: Creating config file "/Users/tomaskorcak/.microcrawler/config.json"
{
    "client": "superagent",
    "timeout": 10000,
    "throttler": {
        "enabled": false,
        "active": true,
        "rate": 20,
        "ratePer": 1000,
        "concurrent": 8
    },
    "retry": {
        "count": 2
    },
    "headers": {
        "Accept": "*/*",
        "User-Agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
        "From": "googlebot(at)googlebot.com"
    },
    "proxy": {
        "enabled": false,
        "list": [
            "https://168.63.20.19:8145"
        ]
    },
    "natFaker": {
        "enabled": true,
        "base": "192.168.1.1",
        "bits": 16
    },
    "amqp": {
        "uri": "amqp://localhost",
        "queues": {
            "collector": "collector",
            "worker": "worker"
        },
        "options": {
            "heartbeat": 60
        }
    },
    "couchbase": {
        "uri": "couchbase://localhost:8091",
        "bucket": "microcrawler",
        "username": "Administrator",
        "password": "Administrator",
        "connectionTimeout": 60000000,
        "durabilityTimeout": 60000000,
        "managementTimeout": 60000000,
        "nodeConnectionTimeout": 10000000,
        "operationTimeout": 10000000,
        "viewTimeout": 10000000
    },
    "elasticsearch": {
        "uri": "localhost:9200",
        "index": "microcrawler",
        "log": "debug"
    }
}

Edit config file

$ vim ~/.microcrawler/config.json

Show config file

$ microcrawler config show
{
    "client": "superagent",
    "timeout": 10000,
    "throttler": {
        "enabled": false,
        "active": true,
        "rate": 20,
        "ratePer": 1000,
        "concurrent": 8
    },
    "retry": {
        "count": 2
    },
    "headers": {
        "Accept": "*/*",
        "User-Agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
        "From": "googlebot(at)googlebot.com"
    },
    "proxy": {
        "enabled": false,
        "list": [
            "https://168.63.20.19:8145"
        ]
    },
    "natFaker": {
        "enabled": true,
        "base": "192.168.1.1",
        "bits": 16
    },
    "amqp": {
        "uri": "amqp://example.com",
        "queues": {
            "collector": "collector",
            "worker": "worker"
        },
        "options": {
            "heartbeat": 60
        }
    },
    "couchbase": {
        "uri": "couchbase://example.com:8091",
        "bucket": "microcrawler",
        "username": "Administrator",
        "password": "Administrator",
        "connectionTimeout": 60000000,
        "durabilityTimeout": 60000000,
        "managementTimeout": 60000000,
        "nodeConnectionTimeout": 10000000,
        "operationTimeout": 10000000,
        "viewTimeout": 10000000
    },
    "elasticsearch": {
        "uri": "example.com:9200",
        "index": "microcrawler",
        "log": "debug"
    }
}

Start Couchbase

TBD

Start Elasticsearch

TBD

Start Kibana

TBD

Query elasticsearch

TBD

Example usage

Craiglist

microcrawler crawl craiglist.index http://sfbay.craigslist.org/sfc/sss/

Firmy.cz

microcrawler crawl firmy.cz.index "https://www.firmy.cz?_escaped_fragment_="

Google

microcrawler crawl google.index http://google.com/search?q=Buena+Vista

Hacker News

microcrawler crawl hackernews.index https://news.ycombinator.com/

xkcd

microcrawler crawl xkcd.index http://xkcd.com

Yelp

microcrawler crawl yelp.index "http://www.yelp.com/search?find_desc=restaurants&find_loc=Los+Angeles%2C+CA&ns=1&ls=f4de31e623458437"

Youjizz

microcrawler crawl youjizz.com.index http://youjizz.com

Credits

FAQs

Package last updated on 03 Sep 2016

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts