New Research: Supply Chain Attack on Axios Pulls Malicious Dependency from npm.Details →
Socket
Book a DemoSign in
Socket

crawl-cli-tool

Package Overview
Dependencies
Maintainers
1
Versions
2
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

crawl-cli-tool

A CLI tool for web crawling with auto-discovery, recursive crawling, and markdown output

latest
npmnpm
Version
1.0.1
Version published
Maintainers
1
Created
Source

crawl-cli

A CLI tool for web crawling with auto-discovery, recursive crawling, and markdown output. Optimized to be used as a Claude Skill.

Features

  • Auto-Discovery: Automatically finds and uses llms.txt, sitemap.xml, or robots.txt
  • Recursive Crawling: Follow internal links with configurable depth
  • Batch Processing: Parallel crawling with concurrency control
  • Multiple Output Formats: Markdown, JSON, HTML, or plain text
  • Flexible Output: Single file, directory (each URL as separate file), or stdout
  • GitHub Support: Transforms GitHub blob URLs to raw content

Installation

npm install -g crawl-cli-tool

After installation, Playwright's Chromium browser will be automatically installed.

Quick Reference

TaskCommand
Crawl single pagecrawl-cli <url>
Recursive crawl (depth 2)crawl-cli <url> -d 2
Save to single filecrawl-cli <url> -o output.md
Each URL as separate filecrawl-cli <url> -d 2 -O ./docs/
JSON formatcrawl-cli <url> -f json
Discover filescrawl-cli discover <url>

Usage

Basic Crawl

# Crawl a single page
crawl-cli https://example.com

# Crawl with auto-discovery (default for single page)
crawl-cli https://docs.anthropic.com

Recursive Crawling

# Crawl with depth 3
crawl-cli https://example.com -d 3

# Limit to 50 pages
crawl-cli https://example.com -d 3 -m 50

Output Formats

# Markdown (default)
crawl-cli https://example.com -f md

# JSON output
crawl-cli https://example.com -f json

# HTML output (styled, ready for browser)
crawl-cli https://example.com -f html -o page.html

# Plain text (stripped formatting)
crawl-cli https://example.com -f txt

Output Destinations

Single File (-o)

All crawled pages combined into one file:

# Single page to file
crawl-cli https://example.com/page -o ./articles/page.md

# Multiple pages (depth 2) to one file
crawl-cli https://docs.example.com -d 2 -o ./all-docs.md

# Append to existing file
crawl-cli https://another.com -o ./all-docs.md --append

Directory - Each URL as Separate File (-O)

Each crawled URL becomes its own file:

# Crawl docs site, each page as separate file
crawl-cli https://docs.anthropic.com -d 2 -O ./anthropic-docs/

# Result:
# anthropic-docs/
# ├── _index.md           # Auto-generated index with links
# ├── docs_anthropic_com.md
# ├── getting-started.md
# ├── api_reference.md
# └── guides_setup.md
# Same in JSON format
crawl-cli https://docs.example.com -d 2 -O ./docs/ -f json

# Skip index file generation
crawl-cli https://docs.example.com -d 2 -O ./docs/ --no-index

# See all files created
crawl-cli https://docs.example.com -d 2 -O ./docs/ -v

stdout (Default)

# Print to console
crawl-cli https://example.com

# Quiet mode for piping (no spinner/progress)
crawl-cli https://example.com -q -f json | jq '.[] | .title'

Discovery

# Just discover available files (llms.txt, sitemap, etc.)
crawl-cli discover https://example.com

# Force auto-discovery
crawl-cli https://example.com --discover

# Disable auto-discovery
crawl-cli https://example.com --no-discover

Advanced Options

# Increase concurrent requests
crawl-cli https://example.com -c 10

# Set page timeout (ms)
crawl-cli https://example.com -t 60000

# Verbose output
crawl-cli https://example.com -v

Options

OptionDescriptionDefault
-d, --depth <n>Maximum crawl depth1
-m, --max-pages <n>Maximum pages to crawl100
-c, --concurrent <n>Concurrent requests5
-o, --output <file>Output to single file (combined)stdout
-O, --output-dir <folder>Output to directory (each URL separate)-
-f, --format <type>Output format: md, json, html, txtmd
--jsonShorthand for -f json-
--appendAppend to file instead of overwritefalse
--no-indexSkip index file for directory outputfalse
--discoverForce auto-discoveryauto
--no-discoverDisable auto-discovery-
-t, --timeout <ms>Page timeout30000
-v, --verboseVerbose outputfalse
-q, --quietMinimal output (good for piping)false

Auto-Discovery Priority

When crawling, the tool checks for these files in order:

  • llms.txt - AI assistant instructions file
  • llms-full.txt - Full llms.txt variant
  • sitemap.xml - Site structure map
  • robots.txt - Crawling rules
  • .well-known/ai.txt - Well-known AI file
  • .well-known/llms.txt - Well-known llms variant
  • .well-known/sitemap.xml - Well-known sitemap

Programmatic Usage

import { crawl, discover, crawlSingle } from 'crawl-cli-tool';

// Full crawl with options
const results = await crawl('https://example.com', {
  maxDepth: 2,
  maxPages: 50,
  autoDiscover: true,
});

// Single page
const result = await crawlSingle('https://example.com/page');

// Just discovery
const discovered = await discover('https://example.com');

Troubleshooting

"Chromium not found"

npx playwright install chromium

"Timeout waiting for page"

crawl-cli <url> -t 60000  # 60 second timeout

"Too many pages"

crawl-cli <url> -m 20 -d 1  # Limit to 20 pages, depth 1

License

MIT

Keywords

crawler

FAQs

Package last updated on 01 Dec 2025

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts