Socket
Book a DemoInstallSign in
Socket

SFAscrapi

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

SFAscrapi

1.2.2
bundlerRubygems
Version published
Maintainers
1
Created
Source

== ScrAPI toolkit for Ruby

A framework for writing scrapers using CSS selectors and simple select => extract => store processing rules.

Here’s an example that scrapes auctions from eBay:

ebay_auction = Scraper.define do process "h3.ens>a", :description=>:text, :url=>"@href" process "td.ebcPr>span", :price=>:text process "div.ebPicture >a>img", :image=>"@src"

result :description, :url, :price, :image

end

ebay = Scraper.define do array :auctions

process "table.ebItemlist tr.single",
        :auctions => ebay_auction

result :auctions

end

And using the scraper:

auctions = ebay.scrape(html)

No. of auctions found

puts auctions.size

First auction:

auction = auctions[0] puts auction.description puts auction.url

To get the latest source code with regular updates:

svn co http://labnotes.org/svn/public/ruby/scrapi

== Using TIDY

By default scrAPI uses Tidy to cleanup the HTML.

You need to install the Tidy Gem for Ruby: gem install tidy

And the Tidy binary libraries, available here:

http://tidy.sourceforge.net/

By default scrAPI looks for the Tidy DLL (Windows) or shared library (Linux) in the directory lib/tidy. That's one place to place the Tidy library.

Alternatively, just point Tidy to the library with:

Tidy.path = "...."

On Linux this would probably be:

Tidy.path = "/usr/local/lib/libtidy.so"

On OS/X this would probably be:

Tidy.path = “/usr/lib/libtidy.dylib”

For testing purposes, you can also use the built in HTML parser. It's useful for testing and getting up to grabs with scrAPI, but it doesn't deal well with broken HTML. So for testing only:

Scraper::Base.parser :html_parser

== License

Copyright (c) 2006 Assaf Arkin, under Creative Commons Attribution and/or MIT License

Developed for http://co.mments.com

Code and documention: http://labnotes.org

HTML cleanup and good hygene by Tidy, Copyright (c) 1998-2003 World Wide Web Consortium. License at http://tidy.sourceforge.net/license.html

HTML DOM extracted from Rails, Copyright (c) 2004 David Heinemeier Hansson. Under MIT license.

HTML parser by Takahiro Maebashi and Katsuyuki Komatsu, Ruby license. http://www.jin.gr.jp/~nahi/Ruby/html-parser/README.html

FAQs

Package last updated on 19 Oct 2010

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

About

Packages

Stay in touch

Get open source security insights delivered straight into your inbox.

  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc

U.S. Patent No. 12,346,443 & 12,314,394. Other pending.