Auto-Discovery: Automatically finds and uses llms.txt, sitemap.xml, or robots.txt
Recursive Crawling: Follow internal links with configurable depth
Batch Processing: Parallel crawling with concurrency control
Multiple Output Formats: Markdown, JSON, HTML, or plain text
Flexible Output: Single file, directory (each URL as separate file), or stdout
GitHub Support: Transforms GitHub blob URLs to raw content

Installation

npm install -g crawl-cli-tool

After installation, Playwright's Chromium browser will be automatically installed.

Quick Reference

Task	Command
Crawl single page	`crawl-cli <url>`
Recursive crawl (depth 2)	`crawl-cli <url> -d 2`
Save to single file	`crawl-cli <url> -o output.md`
Each URL as separate file	`crawl-cli <url> -d 2 -O ./docs/`
JSON format	`crawl-cli <url> -f json`
Discover files	`crawl-cli discover <url>`

Usage

Basic Crawl

# Crawl a single page
crawl-cli https://example.com

# Crawl with auto-discovery (default for single page)
crawl-cli https://docs.anthropic.com

Recursive Crawling

# Crawl with depth 3
crawl-cli https://example.com -d 3

# Limit to 50 pages
crawl-cli https://example.com -d 3 -m 50

Output Formats

# Markdown (default)
crawl-cli https://example.com -f md

# JSON output
crawl-cli https://example.com -f json

# HTML output (styled, ready for browser)
crawl-cli https://example.com -f html -o page.html

# Plain text (stripped formatting)
crawl-cli https://example.com -f txt

Output Destinations

Single File (`-o`)

All crawled pages combined into one file:

# Single page to file
crawl-cli https://example.com/page -o ./articles/page.md

# Multiple pages (depth 2) to one file
crawl-cli https://docs.example.com -d 2 -o ./all-docs.md

# Append to existing file
crawl-cli https://another.com -o ./all-docs.md --append

Directory - Each URL as Separate File (`-O`)

Each crawled URL becomes its own file:

# Crawl docs site, each page as separate file
crawl-cli https://docs.anthropic.com -d 2 -O ./anthropic-docs/

# Result:
# anthropic-docs/
# ├── _index.md           # Auto-generated index with links
# ├── docs_anthropic_com.md
# ├── getting-started.md
# ├── api_reference.md
# └── guides_setup.md

# Same in JSON format
crawl-cli https://docs.example.com -d 2 -O ./docs/ -f json

# Skip index file generation
crawl-cli https://docs.example.com -d 2 -O ./docs/ --no-index

# See all files created
crawl-cli https://docs.example.com -d 2 -O ./docs/ -v

stdout (Default)

# Print to console
crawl-cli https://example.com

# Quiet mode for piping (no spinner/progress)
crawl-cli https://example.com -q -f json | jq '.[] | .title'

Discovery

# Just discover available files (llms.txt, sitemap, etc.)
crawl-cli discover https://example.com

# Force auto-discovery
crawl-cli https://example.com --discover

# Disable auto-discovery
crawl-cli https://example.com --no-discover

Advanced Options

# Increase concurrent requests
crawl-cli https://example.com -c 10

# Set page timeout (ms)
crawl-cli https://example.com -t 60000

# Verbose output
crawl-cli https://example.com -v

Options

Option	Description	Default
`-d, --depth <n>`	Maximum crawl depth	1
`-m, --max-pages <n>`	Maximum pages to crawl	100
`-c, --concurrent <n>`	Concurrent requests	5
`-o, --output <file>`	Output to single file (combined)	stdout
`-O, --output-dir <folder>`	Output to directory (each URL separate)	-
`-f, --format <type>`	Output format: `md`, `json`, `html`, `txt`	md
`--json`	Shorthand for `-f json`	-
`--append`	Append to file instead of overwrite	false
`--no-index`	Skip index file for directory output	false
`--discover`	Force auto-discovery	auto
`--no-discover`	Disable auto-discovery	-
`-t, --timeout <ms>`	Page timeout	30000
`-v, --verbose`	Verbose output	false
`-q, --quiet`	Minimal output (good for piping)	false

Auto-Discovery Priority

When crawling, the tool checks for these files in order:

llms.txt - AI assistant instructions file
llms-full.txt - Full llms.txt variant
sitemap.xml - Site structure map
robots.txt - Crawling rules
.well-known/ai.txt - Well-known AI file
.well-known/llms.txt - Well-known llms variant
.well-known/sitemap.xml - Well-known sitemap

Programmatic Usage

import { crawl, discover, crawlSingle } from 'crawl-cli-tool';

// Full crawl with options
const results = await crawl('https://example.com', {
  maxDepth: 2,
  maxPages: 50,
  autoDiscover: true,
});

// Single page
const result = await crawlSingle('https://example.com/page');

// Just discovery
const discovered = await discover('https://example.com');

Troubleshooting

"Chromium not found"

npx playwright install chromium

"Timeout waiting for page"

crawl-cli <url> -t 60000  # 60 second timeout

"Too many pages"

crawl-cli <url> -m 20 -d 1  # Limit to 20 pages, depth 1

License

MIT

Keywords

FAQs

What is crawl-cli-tool?

Is crawl-cli-tool well maintained?

Package last updated on 01 Dec 2025

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

crawl-cli-tool

crawl-cli

Features