Research
Security News
Malicious npm Package Targets Solana Developers and Hijacks Funds
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
= Robotstxt
Robotstxt is an Ruby robots.txt file parser.
The robots.txt exclusion protocol is a simple mechanism whereby site-owners can guide any automated crawlers to relevant parts of their site, and prevent them accessing content which is intended only for other eyes. For more information, see http://www.robotstxt.org/.
This library provides mechanisms for obtaining and parsing the robots.txt file from websites. As there is no official "standard" it tries to do something sensible, though inspiration was taken from:
While the parsing semantics of this library are explained below, you should not write sitemaps that depend on all robots acting the same -- they simply won't. Even the various different ruby libraries support very different subsets of functionality.
This gem builds on the work of Simone Rinzivillo, and is released under the MIT license -- see the LICENSE file.
== Usage
There are two public points of interest, firstly the Robotstxt module, and secondly the Robotstxt::Parser class.
The Robotstxt module has three public methods:
Robotstxt.get source, user_agent, (options) Returns a Robotstxt::Parser for the robots.txt obtained from source.
Robotstxt.parse robots_txt, user_agent Returns a Robotstxt::Parser for the robots.txt passed in
Robotstxt.get_allowed? urlish, user_agent, (options) Returns true iff the robots.txt obtained from the host identified by the urlish allows the given user agent access to the url.
The Robotstxt::Parser class contains two pieces of state, the user_agent and the text of the robots.txt. In addition its instances have two public methods:
Robotstxt::Parser#allowed? urlish Returns true iff the robots.txt file allows this user_agent access to that url.
Robotstxt::Parser#sitemaps Returns a list of the sitemaps listed in the robots.txt file.
In the above there are five kinds of parameter,
A "urlish" is either a String that represents a URL (suitable for passing to
URI.parse) or a URI object, i.e.
urlish = "http://www.example.com/"
urlish = "/index.html"
urlish = https://compicat.ed/home?action=fire#joking"
urlish = URI.parse("http://example.co.uk")
A "source" is either a "urlish", or a Net::HTTP connection. This allows the
library to re-use the same connection when the server respects Keep-alive:
headers, i.e.
source = Net::HTTP.new("example.com", 80)
Net::HTTP.start("example.co.uk", 80) do |http|
source = http
end
source = "http://www.example.com/index.html"
When a "urlish" is provided, only the host and port sections are used, and
the path is forced to "/robots.txt".
A "robots_txt" is the textual content of a robots.txt file that is in the
same encoding as the urls you will be fetching (normally utf8).
A "user_agent" is the string value you use in your User-agent: header.
The "options" is an optional hash containing
:num_redirects (5) - the number of redirects to follow before giving up.
:http_timeout (10) - the length of time in seconds to wait for one http
request
:url_charset (utf8) - the charset which you will use to encode your urls.
I recommend not passing the options unless you have to.
== Examples
url = "http://example.com/index.html" if Robotstxt.get_allowed?(url, "Crawler") open(url) end
Net::HTTP.start("example.co.uk") do |http| robots = Robotstxt.get(http, "Crawler")
if robots.allowed? "/index.html"
http.get("/index.html")
elsif robots.allowed? "/index.php"
http.get("/index.php")
end
end
== Details
=== Request level
This library handles different HTTP status codes according to the specifications on robotstxt.org, in particular:
If an HTTPUnauthorized or an HTTPForbidden is returned when trying to access /robots.txt, then the entire site should be considered "Disallowed".
If an HTTPRedirection is returned, it should be followed (though we give up after five redirects, to avoid infinite loops).
If an HTTPSuccess is returned, the body is converted into utf8, and then parsed.
Any other response, or no response, indicates that there are no Disallowed urls no the site.
=== User-agent matching
This is case-insensitive, substring matching, i.e. equivalent to matching the user agent with /.thing./i.
Additionally, * characters are interpreted as meaning any number of any character (in regular expression idiom: /.*/). Google implies that it does this, at least for trailing s, and the standard implies that "" is a special user agent meaning "everything not referred to so far".
There can be multiple User-agent: lines for each section of Allow: and Disallow: lines in the robots.txt file:
User-agent: Google User-agent: Bing Disallow: /secret
In cases like this, all user-agents inherit the same set of rules.
=== Path matching
This is case-sensitive prefix matching, i.e. equivalent to matching the requested path (or path + '?' + query) against /^thing.*/. As with user-agents,
Additionally, when the pattern ends with a $, it forces the pattern to match the entire path (or path + ? + query).
In order to get consistent results, before the globs are matched, the %-encoding is normalised so that only /?&= remain %-encoded. For example, /h%65llo/ is the same as /hello/, but /ac%2fdc is not the same as /ac/dc - this is due to the significance granted to the / operator in urls.
The paths of the first section that matched our user-agent (by order of appearance in the file) are parsed in order of appearance. The first Allow: or Disallow: rule that matches the url is accepted. This is prescribed by robotstxt.org, but other parsers take wildly different strategies: Google checks all Allows: then all Disallows: Bing checks the most-specific first Others check all Disallows: then all Allows
As is conventional, a "Disallow: " line with no path given is treated as "Allow: *", and if a URL didn't match any path specifiers (or the user-agent didn't match any user-agent sections) then that is implicit permission to crawl.
== TODO
I would like to add support for the Crawl-delay directive, and indeed any other parameters in use.
== Requirements
== Installation
This library is intended to be installed via the RubyGems[http://rubyforge.org/projects/rubygems/] system.
$ gem install robotstxt
You might need administrator privileges on your system to install it.
== Author
Author:: {Conrad Irwin} conrad@rapportive.com Author:: {Simone Rinzivillo}[http://www.simonerinzivillo.it/] srinzivillo@gmail.com
== License
Robotstxt is released under the MIT license. Copyright (c) 2010 Conrad Irwin Copyright (c) 2009 Simone Rinzivillo
FAQs
Unknown package
We found that robotstxt-parser demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
Security News
Research
Socket researchers have discovered malicious npm packages targeting crypto developers, stealing credentials and wallet data using spyware delivered through typosquats of popular cryptographic libraries.
Security News
Socket's package search now displays weekly downloads for npm packages, helping developers quickly assess popularity and make more informed decisions.