github.com/benjaminestes/robots/v2

Package robots implements robots.txt parsing and matching based on Google's specification. For a robots.txt primer, please read the full specification at: https://developers.google.com/search/reference/robots_txt. Clients of this package have one obligation: when testing whether a URL can be crawled, use the correct robots.txt file. The specification uses scheme, port, and punycode variations to define which URLs are in scope. To get the right robots.txt file, use Locate. Locate takes as its only argument the URL you want to access. It returns the URL of the robots.txt file that governs access. Locate will always return a single unique robots.txt URL for all input URLs sharing a scope. In practice, a client pattern for testing whether a URL is accessible would be: a) Locate the robots.txt file for the URL; b) check whether you have fetched data for that robots.txt file; c) if yes, use the data to Test the URL against your user agent; d) if no, fetch the robots.txt data and try again. For details, see "File location & range of validity" in the specification: https://developers.google.com/search/reference/robots_txt#file-location--range-of-validity. A generous parser is specified. A valid line is accepted, and an invalid line is silently discarded. This is true even if the content parsed is in an unexpected format, like HTML. For details, see "File format" in the specification: https://developers.google.com/search/reference/robots_txt#file-format The specification states that a crawler will assume all URLs are accessible, even if there is no robots.txt file, or the body of the robots.txt file is empty. So a robots.txt file with a 404 status code will result in all URLs being crawlable. The exception to this is a 5xx status code. This is treated as a temporary "full disallow" of crawling. For details, see "Handling HTTP result codes" in the specification: https://developers.google.com/search/reference/robots_txt#handling-http-result-codes

v2.0.5

Source

Version published: 5 years ago

Readme

Source

robots

Package robots implements robots.txt file parsing based on Google's specification:

https://developers.google.com/search/reference/robots_txt

Documentation

For installation, usage and description, please see the documentation:

https://godoc.org/github.com/benjaminestes/robots

License

MIT

FAQs

What is github.com/benjaminestes/robots/v2?

Last updated on 17 Nov 2019

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

github.com/benjaminestes/robots/v2

robots

Documentation

License

Related posts

Cyber Extortion Demands Skyrocket in 2023 While Fewer Companies Pay Ransoms

TC39 Advances Key Proposals: Deferred Import Evaluation, Error.isError(), RegExp Escaping, Promise.try