You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP
Socket
Book a DemoInstallSign in
Socket

github.com/tomstuart92/web-crawler

Package Overview
Dependencies
Alerts
File Explorer
Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

github.com/tomstuart92/web-crawler

v0.0.0-20190630103535-196702c45b02
Source
Go
Version published
Created
Source

Concurrent Web Crawler

Task

We'd like you to write a simple web crawler in a programming language of your choice. Feel free to either choose one you're very familiar with or, if you'd like to learn some Go, you can also make this your first Go program! The crawler should be limited to one domain - so when you start with https://monzo.com/, it would crawl all pages within monzo.com, but not follow external links, for example to the Facebook and Twitter accounts. Given a URL, it should print a simple site map, showing the links between pages.

Ideally, write it as you would a production piece of code. Bonus points for tests and making it as fast as possible!

Run Instructions

go run main.go --target=https://jigsaw.xyz --concurrency=1 --singleDomain=true

High Level Design

Utilise channels/go routines by spinning up a set of routines up to concurrency which are responsible for fetching URLs. The resultant pages are sent to another set of go routines which are allowed to scale as wide as needed to deal with tokenisation of the html and extraction of the links. This design allows us to have a limited set of concurrent connections to the target site, while dealing with the tokenisation (which can be more expensive than the fetch) in a manner that maximises efficiency.

Once the links have been extracted,they are returned to the main thread which is responsible for maintaining state of which links have been seen, and the relationship between pages via a graph data structure. New links are sent to the worker threads so they can be scraped, and the process continues until we no longer have any pages to scrape.

At that point a sitemap is printed, via a BFS traversal of the graph data structure.

FAQs

Package last updated on 30 Jun 2019

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts