Research
Security News
Malicious npm Package Targets Solana Developers and Hijacks Funds
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
A Ruby library for extracting structured data from websites and web based APIs. Supports most common document formats (i.e. HTML, XML, CSV, and JSON), and comes with a handy mechanism for iterating over paginated datasets.
gem install extraloop
A basic scraper that fetches the top 25 websites from Alexa's daily top 100 list:
alexa_scraper = ExtraLoop::ScraperBase.
new("http://www.alexa.com/topsites").
loop_on("li.site-listing").
extract(:site_name, "h2").
extract(:url, "h2 a").
extract(:description, ".description").
on(:data) { |data| { |record| puts record.site_name } }
alexa_scraper.run
An iterative Scraper that fetches URL, title, and publisher from some 110 Google News articles mentioning the keyword 'Egypt'.
results = []
ExtraLoop::IterativeScraper.
new("https://www.google.com/search?tbm=nws&q=Egypt").
set_iteration(:start, (1..101).step(10)).
loop_on("h3") { |nodes| nodes.map(&:parent) }.
extract(:title, "h3.r a").
extract(:url, "h3.r a", :href).
extract(:source, "br") { |node| node.next.text.split("-").first }.
on(:data) { |data, response| data.each { |record| results << record } }.
run()
#new(urls, scraper_options, http_options)
Typheous::Request#initialize
(see API documentation for details).:info
).Logging.appenders.sterr
).ExtraLoop allows to fetch structured data from online documents by looping through a list of elements matching a given selector.
For each matched element, an arbitrary set of fields can be extracted. While the loop_on
method sets up such loop, the extract
method extracts a specific piece of information from an element (e.g. a story's title) and stores it into a record's field.
# looping over a set of document elements using a CSS3 (or XPath) selector
loop_on('div.post')
# looping
loop_on { |doc| doc.search('div.post') }
# using both a selector and a proc (the matched element list is passed in to the proc as its first argument )
loop_on('div.post') { |posts| posts.reject { |post| post.attr(:class) == 'sticky' } }
Both the loop_on
and the extract
methods may be called with a selector, a block or a combination of the two. By default, when parsing DOM documents, extract
will call
Nokogiri::XML::Node#text()
. Alternatively, extract
also accepts an attribute name and a block. The latter is evaluated in the context of the current iteration's element.
# extract a story's title
extract(:title, 'h3')
# extract a story's url
extract(:url, "a.link-to-story", :href)
# extract a description text, separating paragraphs with newlines
extract(:description, "div.description") { |node| node.css("p").map(&:text).join("\n") }
While processing an HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at
the ContentType
header sent by the server. This value can be overriden by providing a :format
key in the scraper's
initialization options. When format is JSON, the document is parsed using the yajl
JSON parser and converted into a hash.
In this case, both the loop_on
and the extract
methods still behave as illustrated above, except it does not support
CSS3/XPath selectors.
When working with JSON data, you can just use a block and have it return the document elements you want to loop on.
# Fetch a portion of a document using a proc
loop_on { |data| data['query']['categorymembers'] })
Alternatively, the same loop can be defined by passing an array of keys pointing at a hash value located at several levels of depth down into the parsed document structure.
# Same as above, using a hash path
loop_on(['query', 'categorymembers'])
When fetching fields from a JSON document fragment, extract
will often not need a block or an array of keys. If called with only
one argument, it will in fact try to fetch a hash value using the provided field name as key.
# current node:
#
# {
# 'from_user' => "johndoe",
# 'text' => 'bla bla bla',
# 'from_user_id'..
# }
# >> extract(:from_user)
# => "johndoe"
The IterativeScraper
class comes with two methods that allow scrapers to loop over paginated content.
The second iteration method, #continue_with
, allows to continue an interation as long as a block of code returns a truthy, non-nil value (to be assigned to the iteration parameter).
ExtraLoop uses rspec
and rr
as its testing framework. The test suite can be run by calling the rspec
executable from within the spec
directory:
cd spec
rspec *
FAQs
Unknown package
We found that extraloop demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
Security News
Research
Socket researchers have discovered malicious npm packages targeting crypto developers, stealing credentials and wallet data using spyware delivered through typosquats of popular cryptographic libraries.
Security News
Socket's package search now displays weekly downloads for npm packages, helping developers quickly assess popularity and make more informed decisions.