
Web Crawler
Performant, extensible and lean web crawler, utilizes all available CPUs by default.
Uses event loop for I/O and processes for analyzing the pages.
Batteries included
- Basic
httpx
page downloader
S3
page storage
- Local filesystem page storage
Usage
- Have a look at
tests/integration/test_crawl.py
- Implement your own
PageAnalyzer
and PageDownloader
classes
- Optionally customize
structlog
logging, see configuration
- Have fun!
Customization
All classes in the modules folder can be replaced with your custom implementation.