Security News
Opengrep Emerges as Open Source Alternative Amid Semgrep Licensing Controversy
Opengrep forks Semgrep to preserve open source SAST in response to controversial licensing changes.
@jldb/web-to-md
Advanced tools
A CLI tool to crawl (for example) documentation websites and convert them to Markdown.
Welcome to Web-to-MD, the CLI tool that turns websites into your personal Markdown library! 🚀
Ever wished you could magically transform entire websites into neatly organized Markdown files? Well, wish no more! Web-to-MD is here to save the day (and your sanity)!
npm install
(sit back and watch the magic happen)npm run build
to compile the TypeScript codeFire up Web-to-MD with this incantation:
npm start -- -u <url> -o <output_directory> [options]
-u, --url <url>
: The URL of your web treasure trove (required)-o, --output <output>
: Where to stash your Markdown gold (required)-c, --combine
: Merge all pages into one massive scroll of knowledge-e, --exclude <paths>
: Comma-separated list of paths to skip (shh, we won't tell)-r, --rate <rate>
: Max pages per second (default: 5, for the speed demons)-d, --depth <depth>
: How deep should we dig? (default: 3, watch out for dragons)-m, --max-file-size <size>
: Max file size in MB for combined output (default: 2)-n, --name <name>
: Name your combined file (get creative!)-p, --preserve-structure
: Keep the directory structure (for the neat freaks)-t, --timeout <timeout>
: Timeout in seconds for page navigation (default: 3.5)-i, --initial-timeout <initialTimeout>
: Initial timeout for the first page (default: 60)-re, --retries <retries>
: Number of retries for initial page load (default: 3)-w, --workers <workers>
: Number of concurrent workers (default: 1, for the multitaskers)npm start -- -u https://docs.example.com -o ./my_docs -c -d 5 -r 3 -n "ExampleDocs" -w 3
This will:
Web-to-MD comes with a nifty config feature that lets you resume interrupted crawls and customize your crawling experience. Here's how it works:
After a crawl (complete or interrupted), Web-to-MD saves a config.json
file in your output directory. This file contains all the settings and state information from your last crawl.
To resume an interrupted crawl, simply run Web-to-MD with the same output directory. The tool will automatically detect the config.json
file and pick up where it left off.
You can manually edit the config.json
file to customize your next crawl. Here are the available options and their default values:
Option | Description | Default Value |
---|---|---|
url | Starting URL for the crawl | (Required) |
outputDir | Output directory for Markdown files | (Required) |
excludePaths | Paths to exclude from crawling | [] |
maxPagesPerSecond | Maximum pages to crawl per second | 5 |
maxDepth | Maximum depth to crawl | 3 |
maxFileSizeMB | Maximum file size in MB for combined output | 2 |
combine | Combine all pages into a single file | false |
name | Name for the combined output file | undefined |
preserveStructure | Preserve directory structure | false |
timeout | Timeout in seconds for page navigation | 3.5 |
initialTimeout | Initial timeout in seconds for the first page load | 60 |
retries | Number of retries for initial page load | 3 |
numWorkers | Number of concurrent workers | 1 |
You can modify these settings in the config.json
file to customize your crawl. For example:
{
"settings": {
"url": "https://docs.example.com",
"outputDir": "./my_docs",
"excludePaths": ["/blog", "/forum"],
"maxPagesPerSecond": 5,
"maxDepth": 4,
"numWorkers": 3
}
}
Start an initial crawl:
npm start -- -u https://docs.example.com -o ./my_docs -d 3 -w 2
If the crawl is interrupted, Web-to-MD will save the state in ./my_docs/config.json
.
To resume, simply run:
npm start -- -o ./my_docs
To customize, edit ./my_docs/config.json
to change the crawl settings as needed. For example:
{
"settings": {
"url": "https://docs.example.com",
"outputDir": "./my_docs",
"excludePaths": ["/blog", "/forum"],
"maxPagesPerSecond": 5,
"maxDepth": 4,
"numWorkers": 3
}
}
npm start -- -o ./my_docs
This workflow allows you to fine-tune your crawls and easily pick up where you left off!
Got ideas? Found a bug? We're all ears! Open an issue or send a pull request. Let's make Web-to-MD even more awesome together! 🤝
ISC (It's So Cool) License
A big thank you to all the open-source projects that made Web-to-MD possible. You rock! 🎸
Now go forth and crawl some docs! 🕷️📚
FAQs
A CLI tool to crawl (for example) documentation websites and convert them to Markdown.
The npm package @jldb/web-to-md receives a total of 1 weekly downloads. As such, @jldb/web-to-md popularity was classified as not popular.
We found that @jldb/web-to-md demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 0 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Opengrep forks Semgrep to preserve open source SAST in response to controversial licensing changes.
Security News
Critics call the Node.js EOL CVE a misuse of the system, sparking debate over CVE standards and the growing noise in vulnerability databases.
Security News
cURL and Go security teams are publicly rejecting CVSS as flawed for assessing vulnerabilities and are calling for more accurate, context-aware approaches.