Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

spidey-mongo

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

spidey-mongo

  • 0.3.0
  • Rubygems
  • Socket score

Version published
Maintainers
1
Created
Source

Spidey-Mongo

Build Status Gem Version

This gem implements a MongoDB back-end for Spidey, a very simple framework for crawling and scraping web sites.

See Spidey's documentation for a basic example spider class.

The default implementation stores the queue of URLs being crawled, any generated results, and errors as attributes on the spider instance (i.e., in memory). By including this gem's module, spider implementations can store them in a MongoDB database instead.

Usage

Install the gem

gem install spidey-mongo

mongo versus moped

Spidey-Mongo provides three strategies:

  • Spidey::Strategies::Mongo: Compatible with Mongo Ruby Driver 1.x, mongo
  • Spidey::Strategies::Mongo2: Compatible with Mongo Ruby Driver 2.x, mongo, e.g., for use with Mongoid 5.x
  • Spidey::Strategies::Moped: Compatible with the moped 2.x, e.g., for use with Mongoid 3.x and 4.x

You can include either strategy in your classes, as appropriate. All the examples in this README assume Spidey::Strategies::Mongo.

Example spider class

class EbaySpider < Spidey::AbstractSpider
  include Spidey::Strategies::Mongo

  handle "http://www.ebay.com", :process_home

  def process_home(page, default_data = {})
    # ...
  end
end

Invocation

The spider's constructor accepts new parameters for each of the MongoDB collections to employ: url_collection, result_collection, and error_collection.

db = Mongo::Connection.new['example']

spider = EbaySpider.new(
  url_collection: db['urls'],
  result_collection: db['results'],
  error_collection: db['errors'])

With persistent storage of the URL-crawling queue, it's now possible to stop crawling and resume at a later point. The crawl method accepts a new optional crawl_for parameter specifying the number of seconds after which to stop.

spider.crawl crawl_for: 600  # seconds, or more conveniently (w/ActiveSupport): 10.minutes

(The base implementation's max_urls parameter is also useful for this purpose.)

Recording Results

By default, invocations of record(data) by the spider simply insert new documents into the result collection. If corresponding results may already exist in the collection and should instead be updated, define a result_key method that returns a key by which to find the corresponding document. The method is called with a hash of the data being recorded:

class EbaySpider < Spidey::AbstractSpider
  include Spidey::Strategies::Mongo

  def result_key(data)
    data[:detail_url]
  end

  # ...
end

This performs an upsert instead of the usual insert (i.e., an update if a result document matching the key already exists, or insert otherwise).

Contrbuting

Please contribute! See CONTRIBUTING for details.

Copyright (c) 2012-2015 Joey Aghion, Artsy Inc., and Contributors.

See LICENSE.txt for further details.

FAQs

Package last updated on 04 Nov 2015

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc