microcrawler
Status


Screenshots
Available Official Crawlers
List of official publicly available crawlers.
Missing something? Feel free to open issue.
Prerequisites
Installation
From npmjs.org (the easy way)
This is the easiest way. The prerequisites still needs to be satisfied.
npm install -g microcrawler
From Sources
This is useful if you want to tweak the source code, implement new crawler, etc.
# Clone repository
git clone https://github.com/ApolloCrawler/microcrawler.git
# Enter folder
cd microcrawler
# Install required packages - dependencies
npm install
# Install from local sources
npm install -g .
Usage
Show available commands
$ microcrawler
Usage: microcrawler [options] [command]
Commands:
collector [args] Run data collector
config [args] Run config
exporter [args] Run data exporter
worker [args] Run crawler worker
crawl [args] Crawl specified site
help [cmd] display help for [cmd]
Options:
-h, --help output usage information
-V, --version output the version number
Check microcrawler version
$ microcrawler --version
0.1.27
Initialize config file
$ microcrawler config init
2016-09-03T10:45:13.105Z - info: Creating config file "/Users/tomaskorcak/.microcrawler/config.json"
{
"client": "superagent",
"timeout": 10000,
"throttler": {
"enabled": false,
"active": true,
"rate": 20,
"ratePer": 1000,
"concurrent": 8
},
"retry": {
"count": 2
},
"headers": {
"Accept": "*/*",
"User-Agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
"From": "googlebot(at)googlebot.com"
},
"proxy": {
"enabled": false,
"list": [
"https://168.63.20.19:8145"
]
},
"natFaker": {
"enabled": true,
"base": "192.168.1.1",
"bits": 16
},
"amqp": {
"uri": "amqp://localhost",
"queues": {
"collector": "collector",
"worker": "worker"
},
"options": {
"heartbeat": 60
}
},
"couchbase": {
"uri": "couchbase://localhost:8091",
"bucket": "microcrawler",
"username": "Administrator",
"password": "Administrator",
"connectionTimeout": 60000000,
"durabilityTimeout": 60000000,
"managementTimeout": 60000000,
"nodeConnectionTimeout": 10000000,
"operationTimeout": 10000000,
"viewTimeout": 10000000
},
"elasticsearch": {
"uri": "localhost:9200",
"index": "microcrawler",
"log": "debug"
}
}
Edit config file
$ vim ~/.microcrawler/config.json
Show config file
$ microcrawler config show
{
"client": "superagent",
"timeout": 10000,
"throttler": {
"enabled": false,
"active": true,
"rate": 20,
"ratePer": 1000,
"concurrent": 8
},
"retry": {
"count": 2
},
"headers": {
"Accept": "*/*",
"User-Agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
"From": "googlebot(at)googlebot.com"
},
"proxy": {
"enabled": false,
"list": [
"https://168.63.20.19:8145"
]
},
"natFaker": {
"enabled": true,
"base": "192.168.1.1",
"bits": 16
},
"amqp": {
"uri": "amqp://example.com",
"queues": {
"collector": "collector",
"worker": "worker"
},
"options": {
"heartbeat": 60
}
},
"couchbase": {
"uri": "couchbase://example.com:8091",
"bucket": "microcrawler",
"username": "Administrator",
"password": "Administrator",
"connectionTimeout": 60000000,
"durabilityTimeout": 60000000,
"managementTimeout": 60000000,
"nodeConnectionTimeout": 10000000,
"operationTimeout": 10000000,
"viewTimeout": 10000000
},
"elasticsearch": {
"uri": "example.com:9200",
"index": "microcrawler",
"log": "debug"
}
}
Start Couchbase
TBD
Start Elasticsearch
TBD
Start Kibana
TBD
Query elasticsearch
TBD
Example usage
Craiglist
microcrawler crawl craiglist.index http://sfbay.craigslist.org/sfc/sss/
Firmy.cz
microcrawler crawl firmy.cz.index "https://www.firmy.cz?_escaped_fragment_="
Google
microcrawler crawl google.index http://google.com/search?q=Buena+Vista
Hacker News
microcrawler crawl hackernews.index https://news.ycombinator.com/
xkcd
microcrawler crawl xkcd.index http://xkcd.com
Yelp
microcrawler crawl yelp.index "http://www.yelp.com/search?find_desc=restaurants&find_loc=Los+Angeles%2C+CA&ns=1&ls=f4de31e623458437"
Youjizz
microcrawler crawl youjizz.com.index http://youjizz.com
Credits