You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP →

Book a Demo Install Sign in

github.com/akashgupta1909/web-crawler

Package Overview

Dependencies

Alerts

File Explorer

Install Socket

Detect and block malicious and high-risk dependencies

Install

github.com/akashgupta1909/web-crawler

v0.0.0-20250406162316-9d7efe6e2a9d

Source

Version published: 4 months ago

Created: 4 months ago

Source

🕸️ Web Crawler (Golang)

A high-performance internal web crawler written in Go, built to recursively crawl links on a website with customizable concurrency and page limits.

🚀 Features

🕵️ Crawls only internal pages (same domain)
🌐 Resolves both relative and absolute URLs
📈 Crawl summary report (pages sorted by frequency)
✅ Well-structured tests for core modules

📦 Project Structure

.
├── main.go                      # CLI entry point
├── configure.go                # Configuration and setup
├── crawlPage.go                # Crawling logic
├── internal
│   ├── customHTML              # HTML parsing and link extraction
│   │   ├── getHTML.go
│   │   ├── getURLsFromHTML.go
│   │   └── getURLsFromHTML_test.go
│   ├── customPrint             # Report formatting
│   │   ├── printReport.go
│   │   └── printReport_test.go
│   └── customURL               # URL normalization
│       ├── normalize_url.go
│       └── normalize_url_test.go
├── go.mod
└── go.sum

🛠️ Installation

git clone git@github.com:akashgupta1909/web-crawler.git
cd web-crawler
go mod tidy

🧪 Testing

Run the tests using the following command:

go test ./...

🏗️ Build

To build the project, use the following command:

go build

🏃‍♂️ Usage

go run main.go <base-url> [maxConcurrency] [maxPages]

Example

go run main.go https://example.com 10 100

This command will crawl the website https://example.com with a maximum of 10 concurrent requests and a limit of 100 unique urls.

Sample Output

starting crawl of: https://example.blog.dev
concurrency 10
max pages 20
base url https://example.blog.dev
starting crawaling on: blog.boot.dev/path
...

==========================
Report for: https://example.blog.dev
==========================
Found 5 internal links to example.blog.dev
Found 3 internal links to example.blog.dev/about
Found 2 internal links to example.blog.dev/path
...

FAQs

What is github.com/akashgupta1909/web-crawler?

Package last updated on 06 Apr 2025

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

github.com/akashgupta1909/web-crawler

🕸️ Web Crawler (Golang)

🚀 Features

📦 Project Structure

🛠️ Installation

🧪 Testing

🏗️ Build

🏃‍♂️ Usage

Example

Sample Output

Related posts

Introducing License Overlays: Smarter License Management for Real-World Code

Introducing Rust Support in Socket