
Security News
Rspack Introduces Rslint, a TypeScript-First Linter Written in Go
Rspack launches Rslint, a fast TypeScript-first linter built on typescript-go, joining in on the trend of toolchains creating their own linters.
This gem implements a MongoDB back-end for Spidey, a very simple framework for crawling and scraping web sites.
See Spidey's documentation for a basic example spider class.
The default implementation stores the queue of URLs being crawled, any generated results, and errors as attributes on the spider instance (i.e., in memory). By including this gem's module, spider implementations can store them in a MongoDB database instead.
gem install spidey-mongo
mongo
versus moped
Spidey-Mongo provides three strategies:
Spidey::Strategies::Mongo
: Compatible with Mongo Ruby Driver 1.x, mongo
Spidey::Strategies::Mongo2
: Compatible with Mongo Ruby Driver 2.x, mongo
, e.g., for use with Mongoid 5.xSpidey::Strategies::Moped
: Compatible with the moped
2.x, e.g., for use with Mongoid 3.x and 4.xYou can include either strategy in your classes, as appropriate. All the examples in this README assume Spidey::Strategies::Mongo
.
class EbaySpider < Spidey::AbstractSpider
include Spidey::Strategies::Mongo
handle "http://www.ebay.com", :process_home
def process_home(page, default_data = {})
# ...
end
end
The spider's constructor accepts new parameters for each of the MongoDB collections to employ: url_collection
, result_collection
, and error_collection
.
db = Mongo::Connection.new['example']
spider = EbaySpider.new(
url_collection: db['urls'],
result_collection: db['results'],
error_collection: db['errors'])
With persistent storage of the URL-crawling queue, it's now possible to stop crawling and resume at a later point. The crawl
method accepts a new optional crawl_for
parameter specifying the number of seconds after which to stop.
spider.crawl crawl_for: 600 # seconds, or more conveniently (w/ActiveSupport): 10.minutes
(The base implementation's max_urls
parameter is also useful for this purpose.)
By default, invocations of record(data)
by the spider simply insert new documents into the result collection. If corresponding results may already exist in the collection and should instead be updated, define a result_key
method that returns a key by which to find the corresponding document. The method is called with a hash of the data being recorded:
class EbaySpider < Spidey::AbstractSpider
include Spidey::Strategies::Mongo
def result_key(data)
data[:detail_url]
end
# ...
end
This performs an upsert
instead of the usual insert
(i.e., an update if a result document matching the key already exists, or insert otherwise).
Please contribute! See CONTRIBUTING for details.
Copyright (c) 2012-2015 Joey Aghion, Artsy Inc., and Contributors.
See LICENSE.txt for further details.
FAQs
Unknown package
We found that spidey-mongo demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Rspack launches Rslint, a fast TypeScript-first linter built on typescript-go, joining in on the trend of toolchains creating their own linters.
Security News
Hacker Demonstrates How Easy It Is To Steal Data From Popular Password Managers
Security News
Oxlint’s new preview brings type-aware linting powered by typescript-go, combining advanced TypeScript rules with native-speed performance.