You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP
Socket
Book a DemoInstallSign in
Socket

github.com/rhamdeew/sitemap-checker

Package Overview
Dependencies
Alerts
File Explorer
Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

github.com/rhamdeew/sitemap-checker

v0.3.0
Source
Go
Version published
Created
Source

Sitemap Checker

CI/CD Status GitHub release (latest by date) GitHub license Go Report Card GitHub stars

A high-performance Go tool to validate all URLs in a website's sitemap.xml, with comprehensive error detection and redirect validation.

Features

  • Complete sitemap validation: Process both sitemap indexes and individual sitemaps
  • Recursive processing: Handles nested sitemaps within sitemap indexes
  • Redirect detection: Identifies and logs all redirects, capturing the redirect destination
  • Parallel processing: Efficiently checks multiple URLs concurrently with configurable parallelism
  • Rate limiting: Configurable delays between requests to avoid overwhelming servers
  • Detailed logging: Comprehensive logs with timestamps, status codes, and errors
  • Progress visualization: Real-time progress bar to monitor validation status
  • HEAD request optimization: Uses HEAD requests by default, with fallback to GET for URLs that don't support HEAD

Installation

Prerequisites

  • Go 1.18 or higher

Installation Steps

# Clone the repository
git clone https://github.com/yourusername/sitemap_checker.git
cd sitemap_checker

# Build the binary
go build -o sitemap_checker

# Optional: Install system-wide
go install

Usage

# Basic usage
./sitemap_checker -u https://example.com/sitemap.xml

# With custom timeout between requests (default: 1000ms)
./sitemap_checker -u https://example.com/sitemap.xml -t 500

# Specify a custom directory for log files
./sitemap_checker -u https://example.com/sitemap.xml -logdir ./logs

# Run with 10 parallel requests
./sitemap_checker -u https://example.com/sitemap.xml -c 10

# Skip SSL certificate validation
./sitemap_checker -u https://example.com/sitemap.xml -k

# Combine options
./sitemap_checker -u https://example.com/sitemap.xml -t 200 -c 5 -logdir ./logs -k

Command Line Options

FlagDescriptionDefault
-uURL of the sitemap.xml file (required)None (Required)
-tTimeout in milliseconds between check requests1000 (1 second)
-logdirDirectory to store log filesCurrent directory
-cNumber of parallel requests to execute1 (Sequential)
-kSkip SSL certificate validationfalse

Log Files

Log files are automatically created with a naming format of:

hostname-YYYY-MM-DD-HH-MM-SS.log

Example: example-com-2025-03-14-14-30-45.log

Log File Contents

Each log file contains:

  • Header with sitemap URL and start time
  • Concurrency configuration (number of parallel requests)
  • Full list of problematic URLs with details:
    • Invalid status codes (non-2xx)
    • Connection errors
    • Redirects with their destination URLs
  • Summary statistics
  • End timestamp

How It Works

  • Sitemap Retrieval: The tool fetches and parses the provided sitemap URL
  • Recursive Processing: For sitemap indexes, it processes all child sitemaps
  • URL Extraction: All URLs are extracted from the sitemap(s)
  • Parallel Validation Process:
    • Makes a HEAD request for each URL (more efficient)
    • Falls back to GET if HEAD is not supported (status 405)
    • Records status codes, errors, and redirect locations
    • Does not follow redirects - instead flags them as issues
    • Controls concurrency using a semaphore pattern
  • Reporting: Provides a detailed summary of problematic URLs

Redirect Handling

This tool specifically detects and flags redirects (status codes 3xx) without following them. For each redirect, it:

  • Identifies the redirect status code (301, 302, 303, 307, 308)
  • Captures the destination URL from the Location header
  • Marks the URL as problematic in reports
  • Logs the full redirect chain information

Example Output

Retrieving URLs from sitemap...
Found sitemap index with 3 sitemaps
Processing referenced sitemap: https://example.com/post-sitemap.xml
Processing referenced sitemap: https://example.com/page-sitemap.xml
Processing referenced sitemap: https://example.com/category-sitemap.xml
Found 845 URLs to check
Checking URLs...
[==================================================>] 845/845 (100%)

REDIRECT: https://example.com/old-page/ -> https://example.com/new-page/ (Status: 301)
INVALID STATUS: https://example.com/missing-page/ - 404
ERROR: https://example.com/timeout-page/ - Get "https://example.com/timeout-page/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Summary: Found 37 problematic URLs out of 845 total URLs
Redirects: 12 URLs

Performance Tuning

  • The default timeout between requests is 1000ms (1 second)
  • The default concurrency is 1 (sequential requests)
  • For optimal performance:
    • Increase concurrency (-c flag) to check multiple URLs in parallel
    • Adjust the timeout (-t flag) based on the server's capacity
  • Recommended starting values:
    • Small sites: -c 5 -t 500
    • Medium sites: -c 10 -t 1000
    • Large sites: -c 20 -t 2000

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

FAQs

Package last updated on 12 May 2025

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts