
Security News
vlt Launches "reproduce": A New Tool Challenging the Limits of Package Provenance
vlt's new "reproduce" tool verifies npm packages against their source code, outperforming traditional provenance adoption in the JavaScript ecosystem.
@jldb/web-to-md
Advanced tools
A CLI tool to crawl (for example) documentation websites and convert them to Markdown.
Welcome to Web-to-MD, the CLI tool that turns websites into your personal Markdown library! 🚀
Ever wished you could magically transform entire websites into neatly organized Markdown files? Well, wish no more! Web-to-MD is here to save the day (and your sanity)!
npm install
(sit back and watch the magic happen)npm run build
to compile the TypeScript codeFire up Web-to-MD with this incantation:
npm start -- -u <url> -o <output_directory> [options]
-u, --url <url>
: The URL of your web treasure trove (required)-o, --output <output>
: Where to stash your Markdown gold (required)-c, --combine
: Merge all pages into one massive scroll of knowledge-e, --exclude <paths>
: Comma-separated list of paths to skip (shh, we won't tell)-r, --rate <rate>
: Max pages per second (default: 5, for the speed demons)-d, --depth <depth>
: How deep should we dig? (default: 3, watch out for dragons)-m, --max-file-size <size>
: Max file size in MB for combined output (default: 2)-n, --name <name>
: Name your combined file (get creative!)-p, --preserve-structure
: Keep the directory structure (for the neat freaks)-t, --timeout <timeout>
: Timeout in seconds for page navigation (default: 3.5)-i, --initial-timeout <initialTimeout>
: Initial timeout for the first page (default: 60)-re, --retries <retries>
: Number of retries for initial page load (default: 3)-w, --workers <workers>
: Number of concurrent workers (default: 1, for the multitaskers)npm start -- -u https://docs.example.com -o ./my_docs -c -d 5 -r 3 -n "ExampleDocs" -w 3
This will:
Web-to-MD comes with a nifty config feature that lets you resume interrupted crawls and customize your crawling experience. Here's how it works:
After a crawl (complete or interrupted), Web-to-MD saves a config.json
file in your output directory. This file contains all the settings and state information from your last crawl.
To resume an interrupted crawl, simply run Web-to-MD with the same output directory. The tool will automatically detect the config.json
file and pick up where it left off.
You can manually edit the config.json
file to customize your next crawl. Here are the available options and their default values:
Option | Description | Default Value |
---|---|---|
url | Starting URL for the crawl | (Required) |
outputDir | Output directory for Markdown files | (Required) |
excludePaths | Paths to exclude from crawling | [] |
maxPagesPerSecond | Maximum pages to crawl per second | 5 |
maxDepth | Maximum depth to crawl | 3 |
maxFileSizeMB | Maximum file size in MB for combined output | 2 |
combine | Combine all pages into a single file | false |
name | Name for the combined output file | undefined |
preserveStructure | Preserve directory structure | false |
timeout | Timeout in seconds for page navigation | 3.5 |
initialTimeout | Initial timeout in seconds for the first page load | 60 |
retries | Number of retries for initial page load | 3 |
numWorkers | Number of concurrent workers | 1 |
You can modify these settings in the config.json
file to customize your crawl. For example:
{
"settings": {
"url": "https://docs.example.com",
"outputDir": "./my_docs",
"excludePaths": ["/blog", "/forum"],
"maxPagesPerSecond": 5,
"maxDepth": 4,
"numWorkers": 3
}
}
Start an initial crawl:
npm start -- -u https://docs.example.com -o ./my_docs -d 3 -w 2
If the crawl is interrupted, Web-to-MD will save the state in ./my_docs/config.json
.
To resume, simply run:
npm start -- -o ./my_docs
To customize, edit ./my_docs/config.json
to change the crawl settings as needed. For example:
{
"settings": {
"url": "https://docs.example.com",
"outputDir": "./my_docs",
"excludePaths": ["/blog", "/forum"],
"maxPagesPerSecond": 5,
"maxDepth": 4,
"numWorkers": 3
}
}
npm start -- -o ./my_docs
This workflow allows you to fine-tune your crawls and easily pick up where you left off!
Got ideas? Found a bug? We're all ears! Open an issue or send a pull request. Let's make Web-to-MD even more awesome together! 🤝
ISC (It's So Cool) License
A big thank you to all the open-source projects that made Web-to-MD possible. You rock! 🎸
Now go forth and crawl some docs! 🕷️📚
FAQs
A CLI tool to crawl (for example) documentation websites and convert them to Markdown.
We found that @jldb/web-to-md demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 0 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
vlt's new "reproduce" tool verifies npm packages against their source code, outperforming traditional provenance adoption in the JavaScript ecosystem.
Research
Security News
Socket researchers uncovered a malicious PyPI package exploiting Deezer’s API to enable coordinated music piracy through API abuse and C2 server control.
Research
The Socket Research Team discovered a malicious npm package, '@ton-wallet/create', stealing cryptocurrency wallet keys from developers and users in the TON ecosystem.