robotstxt
Package robotstxt implements the Robots Exclusion Protocol, https://en.wikipedia.org/wiki/Robots_exclusion_standard, with a simple API.
Link to the GoDocs -> here.
Basic Examples
1. Creating a robotsTxt with a URL
This is the most common way to use this package.
package main
import (
"fmt"
"github.com/itmayziii/robotstxt"
)
func main () {
ch := make(chan robotstxt.ProtocolResult)
go robotstxt.NewFromURL("https://www.dumpsters.com", ch)
robotsTxt := <-ch
fmt.Println(robotsTxt.Error)
canCrawl, err := robotsTxt.Protocol.CanCrawl("googlebot", "/bdso/pages")
fmt.Println(canCrawl)
fmt.Println(err)
}
2. Creating a robotsTxt Manually
You likely will not be doing this method as you would need to parse get the robots.txt from the server yourself.
package main
import (
"fmt"
"github.com/itmayziii/robotstxt"
)
func main () {
robotsTxt, _ := robotstxt.New("", `
# Robots.txt test file
# 06/04/2018
# Indented comments are allowed
User-agent : *
Crawl-delay: 5
Disallow: /cms/
Disallow: /pricing/frontend
Disallow: /pricing/admin/ # SPA application built into the site
Disallow : *?s=lightbox
Disallow: /se/en$
Disallow:*/retail/*/frontend/*
Allow: /be/fr_fr/retail/fr/
# Multiple groups with all access
User-agent: AdsBot-Google
User-agent: AdsBot-Bing
Allow: /
# Multiple sitemaps
Sitemap: https://www.dumpsters.com/sitemap.xml
Sitemap: https://www.dumpsters.com/sitemap-launch-index.xml
`)
canCrawl, err := robotsTxt.CanCrawl("googlebot", "/cms/pages")
fmt.Println(canCrawl)
fmt.Println(err)
}
Specification
A large portion of how this package handles the specification comes from https://developers.google.com/search/reference/robots_txt.
In fact this package tests against all of the examples listed at
https://developers.google.com/search/reference/robots_txt#url-matching-based-on-path-values plus many more.
Important Notes From the Spec
-
User Agents are case insensitive so "googlebot" and "Googlebot" are the same thing.
-
Directive "Allow" and "Disallow" values are case sensitive so "/pricing" and "/Pricing" are not the same thing.
-
The entire file must be valid UTF-8 encoded, this package will return an error if that is not the case.
-
The most specific user agent wins.
-
Allow and disallow directives also respect the one that is most specific and in the event of a tie the allow directive will win.
-
Directives listed in the robots.txt file apply only to a host, protocol, and port number,
https://developers.google.com/search/reference/robots_txt#file-location--range-of-validity. This package validates the host, protocol,
and port number every time it is asked if a robot "CanCrawl" a path and the path contains the host, protocol, and port.
robotsTxt := robotstxt.New("https://www.dumpsters.com", `
User-agent: *
Disallow: "/wiki/"
`)
robotsTxt.CanCrawl("googlebot", "/products/")
robotsTxt.CanCrawl("googlebot", "https://www.dumpsters.com/products/")
robotsTxt.CanCrawl("googlebot", "http://www.dumpsters.com/products/")