You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP
Socket
Book a DemoInstallSign in
Socket

github.com/bertgit/go-crawler

Package Overview
Dependencies
Alerts
File Explorer
Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

github.com/bertgit/go-crawler

v0.0.0-20180319234250-f2f80fe865c9
Source
Go
Version published
Created
Source

Web Crawler in Go

Design

A pool of workers pops tasks from a url frontier channel. Each worker fetches the html body, extracts links and adds them back on a response channel. The crawler runs them through a url cache and adds the filtered links back into the url frontier.
The output is a file in DOT format (sitemap.dot), which can be used to produce graphs similar to sitemap-reduced.csv

Limitations

  • Only respects html hrefs
  • Inpolite, doesn't consider robots.txt
  • Doesn't detect crawler traps (infinite generated loops, ...)
  • Single machine
  • Lots more compared to serious crawlers...

Build & Run

Install in your go path and run
go build && ./go-crawler

Test

go test

Usage

Usage of ./crawler:  
  -timeout int  
        Timeout in seconds per http request (default 10)  
  -url string  
        url to crawl (default "http://www.monzo.com")  
  -workers int  
        Number of workers (default 100)

Performance

i5 dual core, 5GHz Wifi:
Crawls 305 pages from www.monzo.com in ~3s

Number of workers doesn't improve much beyond 50 for above test case

Sitemap

To create a sitemap visualization, install
http://www.graphviz.org
and run
dot -Tsvg sitemap.dot -O

The resulting sitemap can be very big. The example provided in this repo (sitemap-reduced.svg) is reduced to a single link per website.

FAQs

Package last updated on 19 Mar 2018

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts