Skyscraper
Installation
Skyscraper installation is simple, just run:
gem install skyscraper
or add following entry to your gemfile:
gem "skyscraper"
if you want to use it in your rails project.
Finding nodes by CSS Selectors
>> Skyscraper::fetch("http://rubyonrails.org").first("title").text
>> Skyscraper::fetch("http://rubyonrails.org").first(".copyright p").text
You can use this thanks to Nokogiri#css method.
Reading HTML attributes
>> Skyscraper::fetch("http://rubyonrails.org").first(".announce").class
>> Skyscraper::fetch("http://rubyonrails.org").first("img").height
>> Skyscraper::fetch("http://rubyonrails.org").first(".copyright").style
Notice!
Skyscraper::Node::Base#class method is overriden, to access original class method, please call Skyscraper::Node::Base#original_class
You can find list of all available methods in Reading attributes Section
Using Skyscraper as included module
Fetch content from multiple pages and store it in the active record database is a common problem. You can do this quick, using Skyscraper as included module.
class Sample
include Skyscraper
settings limit: 10, delay: { after: 5, time: 1 }, encoding: "utf-8"
pages ["http://google.com", "https://github.com", "http://rubyonrails.org"]
field :html, "html", :html
field :title, "title" do |node|
"'#{node.text}'"
end
field :first_link, "body" do |node|
"'#{node.first("a").href}'"
end
field :first_image, "img", :download
after_each do |result|
page = Page.new
page.title = result[:title]
page.html = result[:html]
page.first_link = result[:first_link]
page.first_image_path = results[:first_image]
page.save
end
after_all do
puts "Job done"
end
end
Sample.new.fetch
You will find more details in Including section.
Traversing
Traversing through Skyscraper nodes is very similar to the way jQuery provides.
>> Skyscraper::fetch("https://github.com").first(".top-nav").find("li").map(&:html)
Of course you can write the same code in the easier way:
>> Skyscraper::fetch("https://github.com").find(".top-nav li").map(&:html)
or even:
>> Skyscraper::fetch("https://github.com").find(".top-nav li a").map(&:content)
Read more about traversing in Traversing section
Following
You can quickly follow node element if it have href attribute:
>> Skyscraper::fetch("https://github.com").first(".top-nav li a").follow.first("title").html
This example visits first menu item from github.com page, and then fetch title of it.
Downloading
When node element have src or href attribute, you can easily download it:
>> Skyscraper::fetch("http://rubyonrails.org").first(".message img").download
You can either provide download path and new file name in arguments. Default path is also available to set in configuration.
>> Skyscraper::fetch("http://rubyonrails.org").first(".message img").download(path: "/tmp/test/:sequence/:file_name")
>> Skyscraper::fetch("http://rubyonrails.org").first(".message img").download(path: "/tmp/test/:sequence/:file_name")
>> Skyscraper::fetch("http://rubyonrails.org").first(".message img").download(path: "/tmp/test/my_file.png")
>> Skyscraper.config.download_path = "/tmp/test/my_path_from_config/:file_name"
>> Skyscraper::fetch("http://rubyonrails.org").first(".message img").download
#download method returns path to saved file.
Configuration
Please visit Configuration section to get all details of Skyscraper configuration.
Testing
Please consider that you can fetch not only remote sites but also local files. This can be very helpful when you prefer TDD coding.
Other topics
Requirements
Skyscraper requires ruby in > 1.9 version. It's also depending on Nokogiri, Open-Uri, Uri and Actionpack libraries.
What is consider to be added?
- POST requests support
- Reattempt fetching on errors
- Redirects support
- Testing mode - downloading only small amount of records, and showing how they would look in database
- Ruby < 1.9 versions support
- Redis, ActiveRecord cache and storage
- Ruby on Rails generators
Please don't hesitate to post me a comment about above or other functionality that might be added.
Contributors
Here I will post list of contributors, which helps to created documentation and create bug fixes.