microcrawler

Package Overview

Dependencies

Maintainers

Versions

Alerts

File Explorer

Advanced tools

License

Install Socket

Detect and block malicious and high-risk dependencies

Install

microcrawler

Micro implementation of crawler

0.1.30

latest

Source

npm

Version published: 9 years ago

Weekly downloads: 0

Maintainers: 1

Weekly downloads

Created: 11 years ago

Source

microcrawler

Status

Screenshots

Available Official Crawlers

List of official publicly available crawlers.

Missing something? Feel free to open issue.

Prerequisites

Installation

From npmjs.org (the easy way)

This is the easiest way. The prerequisites still needs to be satisfied.

npm install -g microcrawler

From Sources

This is useful if you want to tweak the source code, implement new crawler, etc.

# Clone repository
git clone https://github.com/ApolloCrawler/microcrawler.git

# Enter folder
cd microcrawler

# Install required packages - dependencies
npm install

# Install from local sources
npm install -g .

Usage

Show available commands

$ microcrawler

  Usage: microcrawler [options] [command]


  Commands:

    collector [args]  Run data collector
    config [args]     Run config
    exporter [args]   Run data exporter
    worker [args]     Run crawler worker
    crawl [args]      Crawl specified site
    help [cmd]        display help for [cmd]

  Options:

    -h, --help     output usage information
    -V, --version  output the version number

Check microcrawler version

$ microcrawler --version
0.1.27

Initialize config file

$ microcrawler config init
2016-09-03T10:45:13.105Z - info: Creating config file "/Users/tomaskorcak/.microcrawler/config.json"
{
    "client": "superagent",
    "timeout": 10000,
    "throttler": {
        "enabled": false,
        "active": true,
        "rate": 20,
        "ratePer": 1000,
        "concurrent": 8
    },
    "retry": {
        "count": 2
    },
    "headers": {
        "Accept": "*/*",
        "User-Agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
        "From": "googlebot(at)googlebot.com"
    },
    "proxy": {
        "enabled": false,
        "list": [
            "https://168.63.20.19:8145"
        ]
    },
    "natFaker": {
        "enabled": true,
        "base": "192.168.1.1",
        "bits": 16
    },
    "amqp": {
        "uri": "amqp://localhost",
        "queues": {
            "collector": "collector",
            "worker": "worker"
        },
        "options": {
            "heartbeat": 60
        }
    },
    "couchbase": {
        "uri": "couchbase://localhost:8091",
        "bucket": "microcrawler",
        "username": "Administrator",
        "password": "Administrator",
        "connectionTimeout": 60000000,
        "durabilityTimeout": 60000000,
        "managementTimeout": 60000000,
        "nodeConnectionTimeout": 10000000,
        "operationTimeout": 10000000,
        "viewTimeout": 10000000
    },
    "elasticsearch": {
        "uri": "localhost:9200",
        "index": "microcrawler",
        "log": "debug"
    }
}

Edit config file

$ vim ~/.microcrawler/config.json

Show config file

$ microcrawler config show
{
    "client": "superagent",
    "timeout": 10000,
    "throttler": {
        "enabled": false,
        "active": true,
        "rate": 20,
        "ratePer": 1000,
        "concurrent": 8
    },
    "retry": {
        "count": 2
    },
    "headers": {
        "Accept": "*/*",
        "User-Agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
        "From": "googlebot(at)googlebot.com"
    },
    "proxy": {
        "enabled": false,
        "list": [
            "https://168.63.20.19:8145"
        ]
    },
    "natFaker": {
        "enabled": true,
        "base": "192.168.1.1",
        "bits": 16
    },
    "amqp": {
        "uri": "amqp://example.com",
        "queues": {
            "collector": "collector",
            "worker": "worker"
        },
        "options": {
            "heartbeat": 60
        }
    },
    "couchbase": {
        "uri": "couchbase://example.com:8091",
        "bucket": "microcrawler",
        "username": "Administrator",
        "password": "Administrator",
        "connectionTimeout": 60000000,
        "durabilityTimeout": 60000000,
        "managementTimeout": 60000000,
        "nodeConnectionTimeout": 10000000,
        "operationTimeout": 10000000,
        "viewTimeout": 10000000
    },
    "elasticsearch": {
        "uri": "example.com:9200",
        "index": "microcrawler",
        "log": "debug"
    }
}

Start Couchbase

TBD

Start Elasticsearch

TBD

Start Kibana

TBD

Query elasticsearch

TBD

Example usage

Craiglist

microcrawler crawl craiglist.index http://sfbay.craigslist.org/sfc/sss/

Firmy.cz

microcrawler crawl firmy.cz.index "https://www.firmy.cz?_escaped_fragment_="

Google

microcrawler crawl google.index http://google.com/search?q=Buena+Vista

Hacker News

microcrawler crawl hackernews.index https://news.ycombinator.com/

xkcd

microcrawler crawl xkcd.index http://xkcd.com

Yelp

microcrawler crawl yelp.index "http://www.yelp.com/search?find_desc=restaurants&find_loc=Los+Angeles%2C+CA&ns=1&ls=f4de31e623458437"

Youjizz

microcrawler crawl youjizz.com.index http://youjizz.com

Credits

@pavelbinar for QA and not just that.

FAQs

What is microcrawler?

Is microcrawler popular?

Is microcrawler well maintained?

Package last updated on 03 Sep 2016

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

microcrawler

microcrawler

Status

Screenshots

Available Official Crawlers

Prerequisites

Installation

From npmjs.org (the easy way)

From Sources

Usage

Show available commands

Check microcrawler version

Initialize config file

Edit config file

Show config file

Start Couchbase

Start Elasticsearch

Start Kibana

Query elasticsearch

Example usage

Craiglist

Firmy.cz

Google

Hacker News

xkcd

Yelp

Youjizz

Credits

Related posts

AGENTS.md Gains Traction as an Open Format for AI Coding Agents

Wallet-Draining npm Package Impersonates Nodemailer to Hijack Crypto Transactions