🚀 Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more
Socket
Sign inDemoInstall
Socket

github.com/nerophon/crawler

Package Overview
Dependencies
Alerts
File Explorer
Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

github.com/nerophon/crawler

v0.0.0-20170324102922-0ffa305efcfa
Source
Go
Version published
Created
Source

Crawler

An engineering exercise implemented in Go.

A simple web crawler that visits all pages within a given domain, but does not follow external links. It outputs a simple structured site map, showing for each page:

  • domain-internal page links
  • external page links
  • links to static content such as images

This entire project can be cloned directly from github via: https://github.com/nerophon/crawler

Prerequisites

Installation

  • Clone this project.
  • cd to the project directory
  • run go install

The software will be installed to the $GOPATH/bin directory by default.

Testing & Benchmarking

This software includes unit tests. They can be run as per standard for Go tests:

  • cd to the source folder with test files you wish to run
  • run go test

Benchmarks exist for key steps in the process. These can be run from the root project directory, via the crawler_test.go file. I suggest running each benchmark separately, using the following commands:

go test -bench=BenchmarkFetch -benchtime=7s
go test -bench=BenchmarkCrawl -benchtime=15s

Please be aware that this kind of benchmark could, if run without care, be interpreted as a DOS attack. The benchtime flag may need to be adjusted depending upon which website is being used in the test. I strongly advise NOT using commonly DOS'd websites such as those belonging to major corporations.

Launching

  • cd to the install directory, usually $GOPATH/bin
  • run ./crawler

Operation

At the application command prompt, the following commands are available:

crawl [URL]		begin crawling the specified domain
help			show available commands
quit			exit the application

Press ctrl-c during a crawl to halt and force quit back to the OS command line.

FAQs

Package last updated on 24 Mar 2017

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts