# Concurrent Web Crawler

## Task

We'd like you to write a simple web crawler in a programming language of your choice. Feel free to either choose one you're very familiar with or, if you'd like to learn some Go, you can also make this your first Go program! The crawler should be limited to one domain - so when you start with https://monzo.com/, it would crawl all pages within monzo.com, but not follow external links, for example to the Facebook and Twitter accounts. Given a URL, it should print a simple site map, showing the links between pages.

Ideally, write it as you would a production piece of code. Bonus points for tests and making it as fast as possible!

## Run Instructions

```bash
go run main.go --target=https://jigsaw.xyz --concurrency=1 --singleDomain=true
```

## High Level Design

Utilise channels/go routines by spinning up a set of routines up to `concurrency` which are responsible for fetching URLs. The resultant pages are sent to another set of go routines which are allowed to scale as wide as needed to deal with tokenisation of the html and extraction of the links. This design allows us to have a limited set of concurrent connections to the target site, while dealing with the tokenisation (which can be more expensive than the fetch) in a manner that maximises efficiency.

Once the links have been extracted,they are returned to the main thread which is responsible for maintaining state of which links have been seen, and the relationship between pages via a graph data structure. New links are sent to the worker threads so they can be scraped, and the process continues until we no longer have any pages to scrape.

At that point a sitemap is printed, via a BFS traversal of the graph data structure.

Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6

github.com/tomstuart92/web-crawler

web-crawler

(Experimental) An ambiguous license classifier was found.

What is an ambiguous license classifier?

Ambiguous License Classifier

Source files are encoded using a non-standard text encoding.

What is bad text encoding?

Bad text encoding

Package version is not a valid semantic version (semver).

What is bad semver?

Bad semver

Package has dependencies with an invalid semantic version. This could be a sign of beta, low quality, or unmaintained dependencies.

What is bad dependency semver?

Bad dependency semver

Source files contain bidirectional unicode control characters. This could indicate a Trojan source supply chain attack. See: trojansource.codes for more information.

What are bidirectional unicode control characters?

Bidirectional unicode control characters

This package has multiple bin scripts with the same name. This can cause non-deterministic behavior when installing or could be a sign of a supply chain attack.

What is bin script confusion?

Bin script confusion

This Chrome extension includes a content script '{scriptFile}' that runs on websites matching '{matches}'.

What are Chrome extension content scripts?

Chrome Extension Content Script

This Chrome extension requests access to '{host}'.

What are Chrome extension host permissions?

Chrome Extension Host Permission

This Chrome extension uses the '{permission}' permission.

What are Chrome extension permissions?

Chrome Extension Permission

This Chrome extension requests broad access to websites with the pattern '{host}'.

What are Chrome extension wildcard host permissions?

Chrome Extension Wildcard Host Permission

Semantic versions published out of chronological order.

What is a chronological version anomaly?

Chronological version anomaly

Project maintainer's SSH key has been compromised.

What is a compromised SSH key?

Compromised SSH key

(Experimental) Copyleft license information was found.

What do I need to know about license files?

Copyleft License

Contains a Critical Common Vulnerability and Exposure (CVE).

What is a critical CVE?

Title

Critical CVE

Contains a high severity Common Vulnerability and Exposure (CVE).

What is a CVE?

High CVE

Uses debug, reflection and dynamic code execution features.

What is debug access?

Debug access

The maintainer of the package marked it as deprecated. This could indicate that a single version should not be used, or that the package is no longer maintained and any new vulnerabilities will not be fixed.

What is a deprecated package?

Deprecated

(Experimental) Contains a known deprecated SPDX license exception.

What is a deprecated SPDX exception?

Deprecated SPDX exception

(Experimental) License is deprecated which may have legal implications regarding the package's use.

What is a deprecated license?

Deprecated license

Package name is similar to other popular packages and may not be the package you want.

What is a typosquat?

Possible typosquat attack

Dynamic require can indicate the package is performing dangerous or unsafe dynamic code execution.

What is dynamic require?

Dynamic require

Package does not contain any code. It may be removed, is name squatting, or the result of a faulty package publish.

What is an empty package?

Empty package

Package accesses environment variables, which may be a sign of credential stuffing or data theft.

What is environment variable access?

Environment variable access

(Experimental) Something was found which is explicitly marked as unlicensed.

Explicitly Unlicensed Item

Package optionally loads a dependency which is not specified within any of the package.json dependency fields. It may inadvertently be importing dependencies specified by other packages.

What are extraneous dependencies?

Name

Extraneous dependency

Contains a dependency which resolves to a file. This can obfuscate analysis and serves no useful purpose.

What are file dependencies?

File dependency

Accesses the file system, and could potentially read sensitive data.

What is filesystem access?

Filesystem access

Package has a dependency with a floating version range. This can cause issues if the dependency publishes a new major version.

What are wildcard dependencies?

Wildcard dependency

Contains a dependency which resolves to a remote git URL. Dependencies fetched from git URLs are not immutable and can be used to inject untrusted code or reduce the likelihood of a reproducible install.

What are git dependencies?

Git dependency

Contains a dependency which resolves to a GitHub URL. Dependencies fetched from GitHub specifiers are not immutable can be used to inject untrusted code or reduce the likelihood of a reproducible install.

What are GitHub dependencies?

GitHub dependency

AI has identified unusual behaviors that may pose a security risk.

What is an AI-detected potential code anomaly?

AI-detected potential code anomaly

AI has identified this package as a potential typosquat of a more popular package. This suggests that the package may be intentionally mimicking another package's name, description, or other metadata.

What is AI-detected potential typosquatting?

AI-detected possible typosquat

AI has identified this package as malware. This is a strong signal that the package may be malicious.

What is AI-detected potential malware?

AI-detected potential malware

AI has determined that this package may contain potential security issues or vulnerabilities.

What are AI-detected potential security risks?

AI-detected potential security risk

Contains native code (e.g., compiled binaries or shared libraries). Including native code can obscure malicious behavior.

Why is native code a concern?

Native code

Contains high entropy strings. This could be a sign of encrypted data, leaked secrets or obfuscated code.

What are high entropy strings?

High entropy strings

Contains unicode homoglyphs which can be used in supply chain confusion attacks.

What are unicode homoglyphs?

Unicode homoglyphs

Contains a dependency which resolves to a remote HTTP URL which could be used to inject untrusted code and reduce overall package reliability.

What are http dependencies?

HTTP dependency

Install scripts are run when the package is installed. The majority of malware in npm is hidden in install scripts.

What is an install script?

Install scripts

Package has an invalid manifest file and can cause installation problems if you try to use it.

What is an invalid manifest file?

Invalid manifest file

Source files contain invisible characters. This could indicate source obfuscation or a supply chain attack.

What are invisible characters?

Invisible chars

(Experimental) Package license has recently changed.

What is a license change?

License change

(Experimental) Contains an SPDX license exception.

What is a license exception?

License exception

This package is not allowed per your license policy. Review the package's license to ensure compliance.

What is a license policy violation?

License Policy Violation

Contains long string literals, which may be a sign of obfuscated or packed code.

What's wrong with long strings?

Long strings

Package has recently undergone a major refactor. It may be unstable or indicate significant internal changes. Use caution when updating to versions that include significant changes.

What is a major refactor?

Major refactor

This package is identified as malware. It has been flagged either by Socket's AI scanner and confirmed by our threat research team, or is listed as malicious in security databases and other sources.

What is known malware?

Known malware

This package has inconsistent metadata. This could be malicious or caused by an error when publishing the package.

What is manifest confusion?

Manifest confusion

Contains a medium severity Common Vulnerability and Exposure (CVE).

What is a medium CVE?

Medium CVE

Contains a low severity Common Vulnerability and Exposure (CVE).

What is a mild CVE?

Low CVE

This package contains minified code. This may be harmless in some cases where minified code is included in packaged libraries, however packages on npm should not minify code.

What's wrong with minified code?

Minified code

(Experimental) A package's licensing information has fine-grained problems.

Misc. License Issues

The package was published by an npm account that no longer exists.

What is a non-existent author?

Non-existent author

A required dependency is not declared in package.json and may prevent the package from working.

What is a missing dependency?

Missing dependency

(Experimental) Package does not have a license and consumption legal status is unknown.

What is a missing license?

Missing license

This package is missing its tarball. It could be removed from the npm registry or there may have been an error when publishing.

What is a missing tarball?

Missing package tarball

(Experimental) Package contains multiple licenses.

What is a mixed license?

Mixed license

(Experimental) Package contains a modified version of an SPDX license exception. Please read carefully before using this code.

What is a modified license exception?

Modified license exception

(Experimental) Package contains a modified version of an SPDX license. Please read carefully before using this code.

What is a modified license?

Modified license

What is network access?

Network access

A new npm collaborator published a version of the package for the first time. New collaborators are usually benign additions to a project, but do indicate a change to the security surface area of a package.

What is new author?

New author

Package does not specify a list of contributors or an author in package.json.

Why is contributor and author data important?

No contributors or author data

Package does not have a linked bug tracker in package.json.

Why are bug trackers important?

No bug tracker

(Experimental) License information could not be found.

No License Found

Package does not have a README. This may indicate a failed publish or a low quality package.

Why are READMEs important?

No README

Package does not have a linked source code repository. Without this field, a package will have no reference to the location of the source code use to generate the package.

Why are missing repositories important?

No repository

Package does not have any tests. This is a strong signal of a poorly maintained or low quality package.

github.com/tomstuart92/web-crawler

Concurrent Web Crawler

Task

Run Instructions

High Level Design

Related posts

Introducing License Overlays: Smarter License Management for Real-World Code

Introducing Rust Support in Socket