== Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://rubygems.org/gems/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push
Medusa is a framework for the ruby language to crawl and collect useful information about the pages
it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
=== Features
- Choose the links to follow on each page with +focus_crawl+
- Multi-threaded design for high performance
- Tracks +301+ HTTP redirects
- Allows exclusion of URLs based on regular expressions
- Records response time for each page
- Obey robots.txt directives (optional, but recommended)
- In-memory or persistent storage of pages during crawl, provided by Moneta[https://github.com/moneta-rb/moneta]
- Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).
Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]
=== Examples
Medusa is versatile and to be used programatically, you can start with one or multiple URIs:
require 'medusa'
Medusa.crawl('https://www.example.com', depth_limit: 2)
Or you can pass a block and it will yield the crawler back, to manage configuration or drive its crawling focus:
require 'medusa'
Medusa.crawl('https://www.example.com', depth_limit: 2) do |crawler|
crawler.discard_page_bodies = some_flag
# Persist all the pages state across crawl-runs.
crawler.clear_on_startup = false
crawler.storage = Medusa::Storage.Moneta(:Redis, 'redis://redis.host.name:6379/0')
crawler.skip_links_like(/private/)
crawler.on_pages_like(/public/) do |page|
logger.debug "[public page] #{page.url} took #{page.response_time} found #{page.links.count}"
end
# Use an arbitrary logic, page by page, to continue customize the crawling.
crawler.focus_crawl(/public/) do |page|
page.links.first
end
end
=== Requirements
moneta:: for the key/value storage adapters
nokogiri:: for parsing the HTML of webpages
robotex:: for support of the robots.txt directives
=== Development
To test and develop this gem, additional requirements are:
=== About
Medusa is a revamped version of the defunk anemone gem.
=== License
Copyright (c) 2009 Vertive, Inc.
Copyright (c) 2020 Mauro Asprea
Released under the {MIT License}[https://github.com/brutuscat/medusa-crawler/blob/master/LICENSE.txt]