Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More →

robotstxt-parser

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

robotstxt-parser

0.1.1
Rubygems

Version published: 10 years ago

Maintainers: 1

Created: 10 years ago

Source

= Robotstxt

Robotstxt is an Ruby robots.txt file parser.

The robots.txt exclusion protocol is a simple mechanism whereby site-owners can guide any automated crawlers to relevant parts of their site, and prevent them accessing content which is intended only for other eyes. For more information, see http://www.robotstxt.org/.

This library provides mechanisms for obtaining and parsing the robots.txt file from websites. As there is no official "standard" it tries to do something sensible, though inspiration was taken from:

While the parsing semantics of this library are explained below, you should not write sitemaps that depend on all robots acting the same -- they simply won't. Even the various different ruby libraries support very different subsets of functionality.

This gem builds on the work of Simone Rinzivillo, and is released under the MIT license -- see the LICENSE file.

== Usage

There are two public points of interest, firstly the Robotstxt module, and secondly the Robotstxt::Parser class.

The Robotstxt module has three public methods:

Robotstxt.get source, user_agent, (options) Returns a Robotstxt::Parser for the robots.txt obtained from source.
Robotstxt.parse robots_txt, user_agent Returns a Robotstxt::Parser for the robots.txt passed in
Robotstxt.get_allowed? urlish, user_agent, (options) Returns true iff the robots.txt obtained from the host identified by the urlish allows the given user agent access to the url.

The Robotstxt::Parser class contains two pieces of state, the user_agent and the text of the robots.txt. In addition its instances have two public methods:

Robotstxt::Parser#allowed? urlish Returns true iff the robots.txt file allows this user_agent access to that url.
Robotstxt::Parser#sitemaps Returns a list of the sitemaps listed in the robots.txt file.

In the above there are five kinds of parameter,

A "urlish" is either a String that represents a URL (suitable for passing to
URI.parse) or a URI object, i.e.

  urlish = "http://www.example.com/"
  urlish = "/index.html"
  urlish = https://compicat.ed/home?action=fire#joking"
  urlish = URI.parse("http://example.co.uk")

A "source" is either a "urlish", or a Net::HTTP connection. This allows the
library to re-use the same connection when the server respects Keep-alive:
headers, i.e.

  source = Net::HTTP.new("example.com", 80)
  Net::HTTP.start("example.co.uk", 80) do |http|
    source = http
  end
  source = "http://www.example.com/index.html"

When a "urlish" is provided, only the host and port sections are used, and
the path is forced to "/robots.txt".

A "robots_txt" is the textual content of a robots.txt file that is in the
same encoding as the urls you will be fetching (normally utf8).

A "user_agent" is the string value you use in your User-agent: header.

The "options" is an optional hash containing
  :num_redirects (5) - the number of redirects to follow before giving up.
  :http_timeout (10) - the length of time in seconds to wait for one http
                       request
  :url_charset (utf8) - the charset which you will use to encode your urls.

I recommend not passing the options unless you have to.

== Examples

url = "http://example.com/index.html" if Robotstxt.get_allowed?(url, "Crawler") open(url) end

Net::HTTP.start("example.co.uk") do |http| robots = Robotstxt.get(http, "Crawler")

if robots.allowed? "/index.html"
  http.get("/index.html")
elsif robots.allowed? "/index.php"
  http.get("/index.php")
end

end

== Details

=== Request level

This library handles different HTTP status codes according to the specifications on robotstxt.org, in particular:

If an HTTPUnauthorized or an HTTPForbidden is returned when trying to access /robots.txt, then the entire site should be considered "Disallowed".

If an HTTPRedirection is returned, it should be followed (though we give up after five redirects, to avoid infinite loops).

If an HTTPSuccess is returned, the body is converted into utf8, and then parsed.

Any other response, or no response, indicates that there are no Disallowed urls no the site.

=== User-agent matching

This is case-insensitive, substring matching, i.e. equivalent to matching the user agent with /.thing./i.

Additionally, * characters are interpreted as meaning any number of any character (in regular expression idiom: /.*/). Google implies that it does this, at least for trailing s, and the standard implies that "" is a special user agent meaning "everything not referred to so far".

There can be multiple User-agent: lines for each section of Allow: and Disallow: lines in the robots.txt file:

User-agent: Google User-agent: Bing Disallow: /secret

In cases like this, all user-agents inherit the same set of rules.

=== Path matching

This is case-sensitive prefix matching, i.e. equivalent to matching the requested path (or path + '?' + query) against /^thing.*/. As with user-agents,

is interpreted as any number of any character.

Additionally, when the pattern ends with a $, it forces the pattern to match the entire path (or path + ? + query).

In order to get consistent results, before the globs are matched, the %-encoding is normalised so that only /?&= remain %-encoded. For example, /h%65llo/ is the same as /hello/, but /ac%2fdc is not the same as /ac/dc - this is due to the significance granted to the / operator in urls.

The paths of the first section that matched our user-agent (by order of appearance in the file) are parsed in order of appearance. The first Allow: or Disallow: rule that matches the url is accepted. This is prescribed by robotstxt.org, but other parsers take wildly different strategies: Google checks all Allows: then all Disallows: Bing checks the most-specific first Others check all Disallows: then all Allows

As is conventional, a "Disallow: " line with no path given is treated as "Allow: *", and if a URL didn't match any path specifiers (or the user-agent didn't match any user-agent sections) then that is implicit permission to crawl.

== TODO

I would like to add support for the Crawl-delay directive, and indeed any other parameters in use.

== Requirements

Ruby >= 1.8.7
iconv, net/http and uri

== Installation

This library is intended to be installed via the RubyGems[http://rubyforge.org/projects/rubygems/] system.

$ gem install robotstxt

You might need administrator privileges on your system to install it.

== Author

Author:: {Conrad Irwin} conrad@rapportive.com Author:: {Simone Rinzivillo}[http://www.simonerinzivillo.it/] srinzivillo@gmail.com

== License

FAQs

What is robotstxt-parser?

Is robotstxt-parser well maintained?

Package last updated on 18 Sep 2014

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

robotstxt-parser

Related posts

Typosquatting Cryptographic Libraries: Malicious npm Packages Threaten Crypto Developers with Keylogging and Wallet Theft

Weekly Downloads Now Available in npm Package Search Results