初级开发工程师,基于 http 写的爬虫扩展包。请不要随意下载里面有很多坑。
A simple directory crawler DSL.
Crawls job listing websites for jobs requiring security clearance.
Cosmicrawler is crawler library for Ruby. It provides scalable asynchronous crawling by (http|file|etc) using EventMachine.
A simple web crawler for ruby
Driller is a command line Ruby based web crawler based on Anemone. Driller can crawl website and reports error pages and slow pages and generates HTML reports.
Grab the movies information from the atmovies.com
This rubygem does not have a description or summary.
Email crawler: crawls the top ten Google search results looking for email addresses and exports them to CSV.
Generic Web crawler with a DSL that parses event-related data from web pages
Dead simple yet powerful Ruby crawler for easy parallel crawling with support for an anonymity.
This rubygem does not have a description or summary.
RegexpCrawler is a Ruby library for crawl data from website using regular expression.
FileCrawler searches and controls files in local directory
Simple little website crawler.
Bulbasaur is a helper for crawler operations used in Pread.ly
Gem for crawling data from external sources
Ruby web crawler to access omelete informations
A little website crawler.
A flexible, modular web crawler
Botch is a DSL for quickly creating web crawlers. Inspired by Sinatra.
Easy way to enable AdSense crawler to login and see private or custom pages in your rails application. Basically one custom login filter. Gem enables you to easily slightly increase revenues from Google AdSense/AdWords. It makes it easy to enable crawling on private pages and so get better targeted ads even in pages behind login screen.
This gem helps Crawler Writers to interact with the PromoQui REST API
Add support for ElasticSearch in Polipus crawler
A simple solution to provide on-demand service access (e.g. port 80 on webserver), where a more robust and secure VPN solution is not available. Essentially, it is a more user-friendly form of "port knocking". The original proof-of-concept implementation was run for almost three years by Demotix, to protect development and staging servers from search engine crawlers and other unwanted traffic.
your friendly neighborhood web crawler
Allows your rails application to be spiderable by crawlers
Server browser and Crawler for many games (L4D2, TF2, CS:S, KZMOD, The Ship)
Checks a user agent for a web crawler
A very simple crawler for RubyGems.org used to demo the power of ElasticSearch at RubyConf 2013
Minimal sharding solution for AR
== Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://rubygems.org/gems/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push Medusa is a framework for the ruby language to crawl and collect useful information about the pages it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily. === Features * Choose the links to follow on each page with +focus_crawl+ * Multi-threaded design for high performance * Tracks +301+ HTTP redirects * Allows exclusion of URLs based on regular expressions * Records response time for each page * Obey _robots.txt_ directives (optional, but recommended) * In-memory or persistent storage of pages during crawl, provided by Moneta[https://github.com/moneta-rb/moneta] * Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options). <b>Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]</b> === Examples Medusa is versatile and to be used programatically, you can start with one or multiple URIs: require 'medusa' Medusa.crawl('https://www.example.com', depth_limit: 2) Or you can pass a block and it will yield the crawler back, to manage configuration or drive its crawling focus: require 'medusa' Medusa.crawl('https://www.example.com', depth_limit: 2) do |crawler| crawler.discard_page_bodies = some_flag # Persist all the pages state across crawl-runs. crawler.clear_on_startup = false crawler.storage = Medusa::Storage.Moneta(:Redis, 'redis://redis.host.name:6379/0') crawler.skip_links_like(/private/) crawler.on_pages_like(/public/) do |page| logger.debug "[public page] #{page.url} took #{page.response_time} found #{page.links.count}" end # Use an arbitrary logic, page by page, to continue customize the crawling. crawler.focus_crawl(/public/) do |page| page.links.first end end
An easy to use distributed web-crawler framework based on Redis
Web page crawler.
Fetch books metadata using Amazon Product Advertising API
Rack middleware that executes javascript before serving pages to crawlers.
This rubygem does not have a description or summary.
A gem that collects mangas from websites
Essa gema provê uma api ruby para se fazer o scrapping de páginas html do sistema matricula web e retornar um conteudo que pode ser mais facilmente processado pelo programa
Simple async HTTP crawler based on em-synchrony
Crawler Engine provides function of crawl all news from the customized website
A periodic crawler that fetches the latest CVE additions, parses them, and filters them
livedoor-feeddiscover performs feed autodiscovery using the livedoor Feed Discover API. livedoor Feed Discover API find a Atom/RSS feed(s) from the livedoor Reader crawler database. So, livedoor-feeddiscover do not access the target URL.
A generic web crawler that doesn't crawl outside URLs.
Pantopoda is a web crawler that visits all links on a given domain that's fast and effective.
Using paperclip to generate images from sensible attributes like e-mails and telephone numbers, in order to reduce crawler's success
This rubygem does not have a description or summary.
This rubygem does not have a description or summary.
A web scrawler to get a Marmiton's recipe
Crawl the senegalese web, looking for jobs using the excellent wombat gem