CrawlerDetect is a library to detect bots/crawlers via the user agent
BFS webcrawler that implements Observable
Generic Web crawler with a DSL that parses structured data from web pages
Voight-Kampff detects bots, spiders, crawlers and replicants
Cobweb is a web crawler that can use resque to cluster crawls to quickly crawl extremely large sites which is much more performant than multi-threaded crawlers. It is also a standalone crawler that has a sophisticated statistics monitoring interface to monitor the progress of the crawls.
An easy to use distributed web-crawler framework based on Redis
validate-website is a web crawler for checking the markup validity with XML Schema / DTD and not found urls.
Asynchronous web crawler, scraper and file harvester
Like the X-Men nightcrawler this gem teleports your assets to a OpenStack Swift bucket/container
is_crawler does exactly what you might think it does: determine if the supplied string matches a known crawler or bot.
a crawler toolkit
Iudex is a general purpose web crawler and feed processor in ruby/java. The iudex-da gem provides a PostgreSQL-based content meta-data store and work priority queue.
Rack Middleware adhering to the Google Ajax Crawling Scheme, using a headless browser to render JS heavy pages and serve a dom snapshot of the rendered state to a requesting search engine.
Crawls public LinkedIn profiles via Google
Ruby web crawler using PhantomJS
Arachnid is a web crawler that relies on Bloom Filters to efficiently store visited urls and Typhoeus to avoid the overhead of Mechanize when crawling every page on a domain.
Iudex is a general purpose web crawler and feed processor in ruby/java. The iudex-core gem contains core facilities and notably, does not contain such facilities as database-backed state management.
Rails Analyzer Tools contains Bench, a simple web page benchmarker, Crawler, a tool for beating up on web sites, RailsStat, a tool for monitoring Rails web sites, and IOTail, a tail(1) method for Ruby IOs.
Website crawler and fulltext indexer.
JavaScript enabled web crawler kit
Iudex is a general purpose web crawler and feed processor in ruby/java. The iudex-html gem contains filters for HTML parsing, filtering, exracting text and links.
Iudex is a general purpose web crawler and feed processor in ruby/java. The iudex-worker gem provides a worker deamon for feed/page processing.
Iudex is a general purpose web crawler and feed processor in ruby/java. The iudex-http gem contains and http client agnostic abstraction layer.
Iudex is a general purpose web crawler and feed processor in ruby/java. The iudex-simhash gem contains support for generation and searching over simhash fingerprints
Iudex is a general purpose web crawler and feed processor in ruby/java. The iudex-barc gem contains support for the BARC Basic ARChive format.
Iudex is a general purpose web crawler and feed processor in ruby/java. This gem is an rjack-httpclient-3 based implementation of the iudex-http interfaces.
Crawl instagram photos, posts and videos for download.
Iudex is a general purpose web crawler and feed processor in ruby/java. This gem is a Jetty HTTP Client based implementation of the iudex-http interfaces.
A simple, fast web crawler
Iudex is a general purpose web crawler and feed processor in ruby/java. The iudex-filter gem contains a fundamental filtering/chain of responsbility sub-system.
Iudex is a general purpose web crawler and feed processor in ruby/java. The iudex-rome gems is an adaption of rjack-rome for feed parsing in Iudex.
Show DMM and DMM.R18's crawled data. e.g. ranking
Crawls Twitter
The SimpleCrawler module is a library for crawling web sites. The crawler provides comprehensive data from the page crawled which can be used for page analysis, indexing, accessibility checks etc. Restrictions can be specified to limit crawling of binary files.
Ruby Cheerio is a jQuery style HTML parser, which take selectors as input. This is a Ruby version NodeJS package named 'Cheerio', which is extensively used by crawlers. Please visit the home page for usage details.
This is a crawler framework.
Cangrejo lets you consume crabfarm crawlers using a simple DSL
Crawls Indeed resumes
Iudex is a general purpose web crawler and feed processor in ruby/java. The iudex-http-test gem contains a HTTP test server for testing HTTP client implementations.
Crawl websites
Post URLs to Wayback Machine (Internet Archive), using a crawler, from Sitemap(s) or a list of URLs.
A web crawler written in ruby
RegexpCrawler is a Ruby library for crawl data from website using regular expression.
SemanticCrawler is a ruby library that encapsulates data gathering from different sources. Currently microdata from websites, country information from Freebase, Factbook and FAO (Food and Agriculture Organization of the United Nations), crisis information from GDACS.org and geo data from LinkedGeoData are supported. Additional the GeoNames module allows to get Factbook and FAO country information from GPS coordinates.
Iudex is a general purpose web crawler and feed processor in ruby/java. This gem is an rjack-async-httpclient based implementation of the iudex-http interfaces.
Iudex is a general purpose web crawler and feed processor in ruby/java. The iudex-char-detector gem provides charset detection support.
webget gem - a web (go get) crawler incl. web cache
Automatically protects your staging app from web crawlers and casual visitors.
Web crawler help you with parse and collect data from the web
Grab the information of repository from the GitHub, RubyGems, The Ruby Toolbox and Stackoverflow