SiteMapper
Parallel web crawler implemented in Golang for producing site maps
Installation
go get -u github.com/Matt-Esch/sitemapper
Quick Start
You can use the package to read a site map from a given URL or you can compile
and use the provided binary.
Basic Usage
package main
import (
"log"
"os"
"github.com/Matt-Esch/sitemapper"
)
func main() {
siteMap, err := sitemapper.CrawlDomain("https://monzo.com")
if err != nil {
log.Fatalf("Error: %s", err)
}
siteMap.WriteMap(os.Stdout)
}
Binary usage
The package provides a binary to run the crawler from the command line
go install github.com/Matt-Esch/sitemapper/cmd/sitemapper
sitemapper -u "http://todomvc.com"
http://todomvc.com
http://todomvc.com/
http://todomvc.com/examples/angular-dart/web
http://todomvc.com/examples/angular-dart/web/
http://todomvc.com/examples/angular2
http://todomvc.com/examples/angular2/
http://todomvc.com/examples/angularjs
http://todomvc.com/examples/angularjs/
http://todomvc.com/examples/angularjs_require
http://todomvc.com/examples/angularjs_require/
...
For a list of options use sitemapper -h
-c int
maximum concurrency (default 8)
-d enable debug logs
-k duration
http keep alive timeout (default 30s)
-t duration
http request timeout (default 30s)
-u string
url to crawl (required)
-v enable verbose logging
-w duration
maximum crawl time
Brief implementation outline
-
The bulk of the implementation is found in ./sitemapper.go
-
Tests and benchmarks are defined in ./sitemapper_test.go
-
A test server is defined in ./test/server
and is used to create a
crawlable website that listens on localhost on a random port. This website
adds various traps such as pointing to external domains in order to test
the crawler.
-
The binary to run the web crawler from the command line is defined under
./cmds/sitemapper/main.go
Design choices and limitations:
-
The web crawler is a parallel web crawler with bounded concurrency. A
channel of URLs is consumed by a fixed number of go routines. These go
routines make an http GET request to the received URL, parse it for a tags,
and push previously unseen URLs into the URL channel for further
consumption.
-
The web crawler populates the site map with new URLs before making a request
to the new URL. This means that non-existent pages (404) and non-web page
links (i.e. links to PDFs) will appear in the site map.
-
By default the logic for checking "same domain" considers just the "host"
portion of the URL. The scheme (http/https) is ignored when checking same
domain constraints even though this would be considered cross origin.
It can be quite difficult to define a universally acceptable definition of
"same domain", where some may resort to DNS lookup as the most accurate.
For that reason, a sensible default is provided but it can be overridden by
the caller.
License
Released under the MIT License.