Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

rcrawl

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

rcrawl

  • 0.5.1
  • Rubygems
  • Socket score

Version published
Maintainers
1
Created
Source

= Rcrawl web crawler for Ruby Rcrawl is a web crawler written entirely in ruby. It's limited right now by the fact that it will stay on the original domain provided.

Rcrawl uses a modular approach to processing the HTML it receives. The link exraction portion of Rcrawl depends on the scrAPI toolkit by Assaf Arkin (http://labnotes.org). gem install scrapi

The structure of the crawling process was inspired by the specs of the Mercator crawler (http://www.cindoc.csic.es/cybermetrics/pdf/68.pdf). A (somewhat) brief overview of the design philosophy follows.

== The Rcrawl process

  1. Remove an absolute URL from the URL Server.
  2. Download corresponding document from the internet, grabbing and processing robots.txt first, if available.
  3. Feed the document into a rewind input stream(ris) to be read/re-read as needed. Based on MIME type, invoke the process method of the
  4. processing module associated with that MIME type. For example, a link extractor or tag counter module for text/html MIME types, or a gif stats module for image/gif. By default, all text/html MIME types will pass through the link extractor. Each link will be converted to an absolute URL and tested against a (ideally user-supplied) URL filter to determine if it should be downloaded.
  5. If the URL passes the filter (currently hard coded as Same Domain?), then call the URL-seen? test.
  6. Has the URL been seen before? Namely, is it in the URL Server or has it been downloaded already? If the URL is new, it is added to the URL Server.
  7. Back to step 1, repeat until the URL Server is empty.

== Examples

Instantiate a new Rcrawl object

crawler = Rcrawl::Crawler.new(url)

Begin the crawl process

crawler.crawl

== After the crawler is done crawling

crawler.visited_links

Returns a hash where the key is a url and the value is

the raw html from that url

crawler.raw_html

Returns a hash where the key is a URL and the value is

the error message from stderr

crawler.errors
crawler.external_links

Set user agent

crawler.user_agent = "Your fancy crawler name here"

== License Copyright © 2006 Digital Duckies, LLC, under MIT License

Developed for http://digitalduckies.net

News, code, and documentation at http://blog.digitalduckies.net

FAQs

Package last updated on 25 Jul 2009

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc