robotstxt
Package robotstxt implements the Robots Exclusion Protocol, https://en.wikipedia.org/wiki/Robots_exclusion_standard, with a simple API.
Link to the GoDocs -> here.
Basic Examples
1. Creating a robotsTxt with a URL
This is the most common way to use this package.
package main
import (
"fmt"
"github.com/itmayziii/robotstxt"
)
func main () {
ch := make(chan robotstxt.ProtocolResult)
go robotstxt.NewFromURL("https://www.dumpsters.com", ch)
robotsTxt := <-ch
fmt.Println(robotsTxt.Error)
canCrawl, err := robotsTxt.Protocol.CanCrawl("googlebot", "/bdso/pages")
fmt.Println(canCrawl)
fmt.Println(err)
}
2. Creating a robotsTxt Manually
You likely will not be doing this method as you would need to parse get the robots.txt from the server yourself.
package main
import (
"fmt"
"github.com/itmayziii/robotstxt"
)
func main () {
robotsTxt, _ := robotstxt.New("", `
# Robots.txt test file
# 06/04/2018
# Indented comments are allowed
User-agent : *
Crawl-delay: 5
Disallow: /cms/
Disallow: /pricing/frontend
Disallow: /pricing/admin/ # SPA application built into the site
Disallow : *?s=lightbox
Disallow: /se/en$
Disallow:*/retail/*/frontend/*
Allow: /be/fr_fr/retail/fr/
# Multiple groups with all access
User-agent: AdsBot-Google
User-agent: AdsBot-Bing
Allow: /
# Multiple sitemaps
Sitemap: https://www.dumpsters.com/sitemap.xml
Sitemap: https://www.dumpsters.com/sitemap-launch-index.xml
`)
canCrawl, err := robotsTxt.CanCrawl("googlebot", "/cms/pages")
fmt.Println(canCrawl)
fmt.Println(err)
}
Specification
A large portion of how this package handles the specification comes from https://developers.google.com/search/reference/robots_txt.
In fact this package tests against all of the examples listed at
https://developers.google.com/search/reference/robots_txt#url-matching-based-on-path-values plus many more.
Important Notes From the Spec
-
User Agents are case insensitive so "googlebot" and "Googlebot" are the same thing.
-
Directive "Allow" and "Disallow" values are case sensitive so "/pricing" and "/Pricing" are not the same thing.
-
The entire file must be valid UTF-8 encoded, this package will return an error if that is not the case.
-
The most specific user agent wins.
-
Allow and disallow directives also respect the one that is most specific based on length and in the event of a tie the allow directive will win,
i.e. disallow: /cms/
loses to allow: /cms/
and to allow: /cms*
but not to allow: /cms
.
-
Directives listed in the robots.txt file apply only to a host, protocol, and port number,
https://developers.google.com/search/reference/robots_txt#file-location--range-of-validity. This package validates the host, protocol,
and port number every time it is asked if a robot "CanCrawl" a path and the path contains the host, protocol, and port.
robotsTxt := robotstxt.New("https://www.dumpsters.com", `
User-agent: *
Disallow: "/wiki/"
`)
robotsTxt.CanCrawl("googlebot", "/products/")
robotsTxt.CanCrawl("googlebot", "https://www.dumpsters.com/products/")
robotsTxt.CanCrawl("googlebot", "http://www.dumpsters.com/products/")
Roadmap
- Respect a "noindex" meta tag and HTTP response header as described here.
There a couple of considerations to be taken into account before implementing this:
- We need to leave the current
CanCrawl
method as is since it is meant to determine whether or not a robot can crawl a page prior to actually
loading the page. The "noindex" and meta tag and HTTP response header by nature of where they are located only happen after the crawler has
loaded the page. - Maybe 2 methods would be needed to implement this. One method that would retrieve the response for the user and hand back an instance of
RobotsExclusionProtocol
as well as the response itself, something like CanCrawlPage
, which of course would also go through the robots.txt
logic before even requesting the page. A separate second method that would take an already retrieved response that the user has goes through the
same logic that the first method does. I'm not 100% sure we would need both methods but I can see why some people would want to retrieve the HTTP
response themselves.
- Potentially support the Host directive as described here.