Kabutops
![Coverage](https://codeclimate.com/github/reneklacan/kabutops/coverage.png)
Kabutops is a ruby library which aims to simplify creating website crawlers.
You can define what will be crawled and how it will be saved in the short class definition.
With Kabutops you can easily save data to ElasticSearch 2.x.
Example for every kind of database are located
in the examples directory
Installation
You can install it via gem
gem install kabutops
Or you can put it in your Gemfile
gem 'kabutops'
You will also need Redis database installed and running.
Basic example
Example that will crawl information about gems that start on letter Q or
X and save them to the ElasticSearch.
require 'kabutops'
class GemListCrawler < Kabutops::Crawler
collection ['Q', 'X'].map{ |letter|
{
letter: letter,
url: "https://rubygems.org/gems?letter=#{letter}"
}
}
cache true
wait 2
callbacks do
after_crawl do |resource, page|
links = page.xpath("//a[contains(@href, '/gems?letter=#{resource[:letter]}')]")
links.each do |link|
GemListCrawler << {
letter: resource[:letter],
url: "https://rubygems.org#{link['href']}",
}
end
links = page.xpath("//a[contains(@href, '/gems/')]")
links.each do |link|
GemCrawler << {
letter: resource[:letter],
url: "https://rubygems.org#{link['href']}",
}
end
end
end
end
class GemCrawler < Kabutops::Crawler
cache true
wait 2
elasticsearch do
index :gems
type :gem
data do
id :css, '.title > h2 > a'
title :css, '.title > h2 > a'
authors :css, '.authors > p'
description :css, '#markup > p'
downloads do
total :lambda, ->(resource, page) {
page.css('.downloads.counter > span > strong')[0].text.gsub(',', '').to_i
}
current_version :lambda, ->(resource, page) {
page.css('.downloads.counter > span > strong')[1].text.gsub(',', '').to_i
}
end
end
callbacks do
after_save do |hash|
puts "#{hash[:title]} saved!"
end
end
end
end
GemListCrawler.crawl!
GemCrawler.crawl!
Run it via sidekiq
bundle exec sidekiq -r ./rubygems_crawler.rb -c 1
Documents saved in the ElasticSearch will look like this one
{
"id": "qiita_mail",
"title": "qiita_mail",
"authors": "ongaeshi",
"description":" Write a gem description",
"downloads": {
"total": 2493,
"current_version": 580
}
}
Advanced
class SomeCrawler < Kabutops::Crawler
collection [
{
id: 'some_id',
url: 'some_url.com/some_id',
},
]
agent ->{
}
proxy 'proxy_host.com', 1234
wait 7
skip_existing true
elasticsearch do
host 'some_host.com'
port 12345
index :name_of_index
type :type_of_es_doc
data each: 'xpath if multiple records are located on one site' do
attr1 :xpath, '//*[@class="bla"]', :int
attr2 :css, '.bla', :float
end
callbacks do
before_save do |result|
end
after_save do |result|
end
save_if do |resource, page, result|
end
end
end
callbacks do
after_crawl do |resource, page|
end
before_cache do |resource, page|
end
store_if do
end
end
end
Debugging
As we all know, crawler can't be written on the first time.
Therefore there are methods for debugging
FruitCrawler.debug_first
FruitCrawler.debug_first 7
FruitCrawler.debug_random
FruitCrawler.debug_random 3
FruitCrawler.debug_last
FruitCrawler.debug_last 5
FruitCrawler.debug_all
FruitCrawler.debug_resource { id: '123', url: '...' }
These methods will print out what would be otherwise saved to the
database but for this time there is no save to the database.
Staying up to date
Note: This feature is currently working only with ElasticSearch
For this purpore there is a Watchdog. Updater have to inherit from
this class and this class can be run as a worker via sidekiq or as a
plain ruby script as you can see below.
class GemUpdater < Kabutops::Watchdog
crawler GemCrawler
freshness 1*24*60*60
wait 5
callbacks do
on_outdated do |resource|
puts "#{resource[:title]} outdated!"
GemCrawler << {
url: resource[:url],
}
end
end
end
GemUpdater.loop
ruby rubygems_updater.rb
Anonymity ala Tor
Anonymity can be easily achieved with Peasant gem.
By following this guide
you can create proxy instance that will forward requests to
multiple tor instances.
Then use Peasant proxy address in your Crawler class definition
class MyCrawler < Kabutops::Crawler
...
proxy 'localhost', 81818
...
end
Javascript heavy site
Crawling this kind of sites can be achieved by using non-default agent
(default is Mechanize.new).
class MyCrawler < Kabutops::Crawler
...
agent Bogeyman::Client.new
...
end
Bogeyman
is wrapper build upon Phantomjs.
License
This library is distributed under the Beerware license.