Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
github.com/itmayziii/robotstxt
Package robotstxt implements the Robots Exclusion Protocol, https://en.wikipedia.org/wiki/Robots_exclusion_standard, with a simple API.
Link to the GoDocs -> here.
This is the most common way to use this package.
package main
import (
"fmt"
"github.com/itmayziii/robotstxt"
)
func main () {
ch := make(chan robotstxt.ProtocolResult)
go robotstxt.NewFromURL("https://www.dumpsters.com", ch)
robotsTxt := <-ch
fmt.Println(robotsTxt.Error)
canCrawl, err := robotsTxt.Protocol.CanCrawl("googlebot", "/bdso/pages")
fmt.Println(canCrawl)
fmt.Println(err)
// <nil>
// false
// <nil>
}
You likely will not be doing this method as you would need to parse get the robots.txt from the server yourself.
package main
import (
"fmt"
"github.com/itmayziii/robotstxt"
)
func main () {
robotsTxt, _ := robotstxt.New("", `
# Robots.txt test file
# 06/04/2018
# Indented comments are allowed
User-agent : *
Crawl-delay: 5
Disallow: /cms/
Disallow: /pricing/frontend
Disallow: /pricing/admin/ # SPA application built into the site
Disallow : *?s=lightbox
Disallow: /se/en$
Disallow:*/retail/*/frontend/*
Allow: /be/fr_fr/retail/fr/
# Multiple groups with all access
User-agent: AdsBot-Google
User-agent: AdsBot-Bing
Allow: /
# Multiple sitemaps
Sitemap: https://www.dumpsters.com/sitemap.xml
Sitemap: https://www.dumpsters.com/sitemap-launch-index.xml
`)
canCrawl, err := robotsTxt.CanCrawl("googlebot", "/cms/pages")
fmt.Println(canCrawl)
fmt.Println(err)
// Output:
// false
// <nil>
}
A large portion of how this package handles the specification comes from https://developers.google.com/search/reference/robots_txt. In fact this package tests against all of the examples listed at https://developers.google.com/search/reference/robots_txt#url-matching-based-on-path-values plus many more.
Important Notes From the Spec
User Agents are case insensitive so "googlebot" and "Googlebot" are the same thing.
Directive "Allow" and "Disallow" values are case sensitive so "/pricing" and "/Pricing" are not the same thing.
The entire file must be valid UTF-8 encoded, this package will return an error if that is not the case.
The most specific user agent wins.
Allow and disallow directives also respect the one that is most specific and in the event of a tie the allow directive will win.
Directives listed in the robots.txt file apply only to a host, protocol, and port number, https://developers.google.com/search/reference/robots_txt#file-location--range-of-validity. This package validates the host, protocol, and port number every time it is asked if a robot "CanCrawl" a path and the path contains the host, protocol, and port.
robotsTxt := robotstxt.New("https://www.dumpsters.com", `
User-agent: *
Disallow: "/wiki/"
`)
robotsTxt.CanCrawl("googlebot", "/products/") // True
robotsTxt.CanCrawl("googlebot", "https://www.dumpsters.com/products/") // True
robotsTxt.CanCrawl("googlebot", "http://www.dumpsters.com/products/") // False - the URL did not match the URL provided when "robotsTxt" was created
FAQs
Unknown package
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.