Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

@jldb/web-to-md

Package Overview
Dependencies
Maintainers
0
Versions
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

@jldb/web-to-md

A CLI tool to crawl (for example) documentation websites and convert them to Markdown.

  • 0.1.0
  • latest
  • Source
  • npm
  • Socket score

Version published
Weekly downloads
4
increased by100%
Maintainers
0
Weekly downloads
 
Created
Source

🕷️ Web-to-MD: Your Friendly Neighborhood Web Crawler and Markdown Converter 🕸️

Welcome to Web-to-MD, the CLI tool that turns websites into your personal Markdown library! 🚀

🌟 Why Web-to-MD?

Ever wished you could magically transform entire websites into neatly organized Markdown files? Well, wish no more! Web-to-MD is here to save the day (and your sanity)!

🎭 Features That'll Make You Go "Wow!"

  • 🔍 Crawls websites like a pro detective
  • 🧙‍♂️ Magically transforms HTML into beautiful Markdown
  • 🏃‍♂️ Resumes interrupted crawls (because life happens!)
  • 📚 Creates separate Markdown files or one big book of knowledge
  • 🎨 Shows fancy progress bars (because who doesn't love those?)
  • 🚦 Respects rate limits (we're polite crawlers here!)
  • 🌳 Preserves directory structure (if you're into that sort of thing)
  • 🔒 Handles authentication gracefully (no trespassing allowed!)
  • 👥 Multi-worker support (because teamwork makes the dream work!)
  • 🔄 Smart content change detection (no need to crawl what hasn't changed!)

🛠️ Installation

  1. Clone this repo (it won't bite, promise!)
  2. Run npm install (sit back and watch the magic happen)
  3. Run npm run build to compile the TypeScript code

🚀 Usage

Fire up Web-to-MD with this incantation:

npm start -- -u <url> -o <output_directory> [options]

🎛️ Options (Mix and Match to Your Heart's Content)

  • -u, --url <url>: The URL of your web treasure trove (required)
  • -o, --output <output>: Where to stash your Markdown gold (required)
  • -c, --combine: Merge all pages into one massive scroll of knowledge
  • -e, --exclude <paths>: Comma-separated list of paths to skip (shh, we won't tell)
  • -r, --rate <rate>: Max pages per second (default: 5, for the speed demons)
  • -d, --depth <depth>: How deep should we dig? (default: 3, watch out for dragons)
  • -m, --max-file-size <size>: Max file size in MB for combined output (default: 2)
  • -n, --name <name>: Name your combined file (get creative!)
  • -p, --preserve-structure: Keep the directory structure (for the neat freaks)
  • -t, --timeout <timeout>: Timeout in seconds for page navigation (default: 3.5)
  • -i, --initial-timeout <initialTimeout>: Initial timeout for the first page (default: 60)
  • -re, --retries <retries>: Number of retries for initial page load (default: 3)
  • -w, --workers <workers>: Number of concurrent workers (default: 1, for the multitaskers)

🌟 Example (Because We All Need a Little Guidance)

npm start -- -u https://docs.example.com -o ./my_docs -c -d 5 -r 3 -n "ExampleDocs" -w 3

This will:

  1. Crawl https://docs.example.com
  2. Save Markdown files to ./my_docs
  3. Combine all pages into one file
  4. Crawl up to 5 levels deep
  5. Respect a rate limit of 3 pages per second
  6. Name the combined file "ExampleDocs"
  7. Use 3 concurrent workers for faster crawling

🔧 Config Magic: Resuming and Customizing Your Crawls

Web-to-MD comes with a nifty config feature that lets you resume interrupted crawls and customize your crawling experience. Here's how it works:

📁 Config File

After a crawl (complete or interrupted), Web-to-MD saves a config.json file in your output directory. This file contains all the settings and state information from your last crawl.

🔄 Resuming a Crawl

To resume an interrupted crawl, simply run Web-to-MD with the same output directory. The tool will automatically detect the config.json file and pick up where it left off.

🎛️ Customizing Your Crawl

You can manually edit the config.json file to customize your next crawl. Here are the available options and their default values:

OptionDescriptionDefault Value
urlStarting URL for the crawl(Required)
outputDirOutput directory for Markdown files(Required)
excludePathsPaths to exclude from crawling[]
maxPagesPerSecondMaximum pages to crawl per second5
maxDepthMaximum depth to crawl3
maxFileSizeMBMaximum file size in MB for combined output2
combineCombine all pages into a single filefalse
nameName for the combined output fileundefined
preserveStructurePreserve directory structurefalse
timeoutTimeout in seconds for page navigation3.5
initialTimeoutInitial timeout in seconds for the first page load60
retriesNumber of retries for initial page load3
numWorkersNumber of concurrent workers1

You can modify these settings in the config.json file to customize your crawl. For example:

{
  "settings": {
    "url": "https://docs.example.com",
    "outputDir": "./my_docs",
    "excludePaths": ["/blog", "/forum"],
    "maxPagesPerSecond": 5,
    "maxDepth": 4,
    "numWorkers": 3
  }
}

🌟 Example Workflow

  1. Start an initial crawl:

    npm start -- -u https://docs.example.com -o ./my_docs -d 3 -w 2
    
  2. If the crawl is interrupted, Web-to-MD will save the state in ./my_docs/config.json.

  3. To resume, simply run:

    npm start -- -o ./my_docs
    
  4. To customize, edit ./my_docs/config.json to change the crawl settings as needed. For example:

{
  "settings": {
    "url": "https://docs.example.com",
    "outputDir": "./my_docs",
    "excludePaths": ["/blog", "/forum"],
    "maxPagesPerSecond": 5,
    "maxDepth": 4,
    "numWorkers": 3
  }
}
  1. Run the crawl again with the updated config:
    npm start -- -o ./my_docs
    

This workflow allows you to fine-tune your crawls and easily pick up where you left off!

🎭 Contributing

Got ideas? Found a bug? We're all ears! Open an issue or send a pull request. Let's make Web-to-MD even more awesome together! 🤝

📜 License

ISC (It's So Cool) License

🙏 Acknowledgements

A big thank you to all the open-source projects that made Web-to-MD possible. You rock! 🎸

Now go forth and crawl some docs! 🕷️📚

Keywords

FAQs

Package last updated on 26 Jun 2024

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc