
Security News
New Website “Is It Really FOSS?” Tracks Transparency in Open Source Distribution Models
A new site reviews software projects to reveal if they’re truly FOSS, making complex licensing and distribution models easy to understand.
== ScrAPI toolkit for Ruby
A framework for writing scrapers using CSS selectors and simple select => extract => store processing rules.
Here’s an example that scrapes auctions from eBay:
ebay_auction = Scraper.define do process "h3.ens>a", :description=>:text, :url=>"@href" process "td.ebcPr>span", :price=>:text process "div.ebPicture >a>img", :image=>"@src"
result :description, :url, :price, :image
end
ebay = Scraper.define do array :auctions
process "table.ebItemlist tr.single",
:auctions => ebay_auction
result :auctions
end
And using the scraper:
auctions = ebay.scrape(html)
puts auctions.size
auction = auctions[0] puts auction.description puts auction.url
To get the latest source code with regular updates:
svn co http://labnotes.org/svn/public/ruby/scrapi
== Using TIDY
By default scrAPI uses Tidy to cleanup the HTML.
You need to install the Tidy Gem for Ruby: gem install tidy
And the Tidy binary libraries, available here:
By default scrAPI looks for the Tidy DLL (Windows) or shared library (Linux) in the directory lib/tidy. That's one place to place the Tidy library.
Alternatively, just point Tidy to the library with:
Tidy.path = "...."
On Linux this would probably be:
Tidy.path = "/usr/local/lib/libtidy.so"
On OS/X this would probably be:
Tidy.path = “/usr/lib/libtidy.dylib”
For testing purposes, you can also use the built in HTML parser. It's useful for testing and getting up to grabs with scrAPI, but it doesn't deal well with broken HTML. So for testing only:
Scraper::Base.parser :html_parser
== License
Copyright (c) 2006 Assaf Arkin, under Creative Commons Attribution and/or MIT License
Developed for http://co.mments.com
Code and documention: http://labnotes.org
HTML cleanup and good hygene by Tidy, Copyright (c) 1998-2003 World Wide Web Consortium. License at http://tidy.sourceforge.net/license.html
HTML DOM extracted from Rails, Copyright (c) 2004 David Heinemeier Hansson. Under MIT license.
HTML parser by Takahiro Maebashi and Katsuyuki Komatsu, Ruby license. http://www.jin.gr.jp/~nahi/Ruby/html-parser/README.html
FAQs
Unknown package
We found that SFAscrapi demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
A new site reviews software projects to reveal if they’re truly FOSS, making complex licensing and distribution models easy to understand.
Security News
Astral unveils pyx, a Python-native package registry in beta, designed to speed installs, enhance security, and integrate deeply with uv.
Security News
The Latio podcast explores how static and runtime reachability help teams prioritize exploitable vulnerabilities and streamline AppSec workflows.