
Security News
New CVE Forecasting Tool Predicts 47,000 Disclosures in 2025
CVEForecast.org uses machine learning to project a record-breaking surge in vulnerability disclosures in 2025.
š AI-powered web scraping with modern CSS support. Extract content from any website using GPT-4, handles CSS Grid/Flexbox layouts, Tailwind CSS, and complex selectors automatically.
AI-powered universal article extractor that automatically detects and extracts article patterns from any website using OpenAI's GPT models.
git clone <repository-url>
cd html2rss-ai
cp .env.example .env
# Edit .env and set your OPENAI_API_KEY
# Create config/batch_config.json with your URLs
{
"urls": [
{
"url": "https://example.com/blog",
"output_dir": "data/output/example",
"force_regenerate": false
}
]
}
# Process all URLs in batch_config.json
docker compose run --rm html2rss-ai
# Use custom configuration file
docker compose run --rm html2rss-ai /app/config/my_config.json
./data/output/
./pattern_cache/
pip install html2rss-ai
config.json
):{
"urls": [
{
"url": "https://example.com/blog",
"output_dir": "output",
"force_regenerate": false
}
]
}
export OPENAI_API_KEY="your-api-key"
python -m html2rss_ai.batch_processor config.json
# Create configuration with Paul Graham's essays
echo '{
"urls": [
{
"url": "https://www.paulgraham.com/articles.html",
"output_dir": "data/output/paulgraham",
"force_regenerate": false
}
]
}' > config/paulgraham.json
# Process with Docker
docker compose run --rm html2rss-ai /app/config/paulgraham.json
Option 1: JSON-based Batch Processing (Recommended for multiple URLs)
config/batch_config.json
):{
"urls": [
{
"url": "https://www.paulgraham.com/articles.html",
"output_dir": "data/output",
"force_regenerate": false
},
{
"url": "https://news.ycombinator.com",
"output_dir": "data/output/hn",
"force_regenerate": true
}
]
}
# Build and run the batch processor
docker compose build html2rss-ai
docker compose run --rm html2rss-ai
# With custom configuration
docker compose run --rm html2rss-ai /app/config/my_config.json
# With error handling options
docker compose run --rm html2rss-ai /app/config/batch_config.json --continue-on-error
š Complete Batch Processing Guide - Detailed documentation with all configuration options.
All settings are configured through the JSON configuration file:
{
"urls": [
{
"url": "https://example.com/blog",
"output_dir": "data/output/custom",
"pattern_cache_dir": "pattern_cache/custom",
"force_regenerate": false,
"save_output": true
}
]
}
For processing multiple URLs, create a JSON configuration file:
{
"urls": [
{
"url": "https://example.com/blog",
"output_dir": "data/output/example",
"pattern_cache_dir": "pattern_cache/example",
"force_regenerate": false,
"save_output": true
}
]
}
See docs/BATCH_PROCESSING.md for complete configuration options.
Variable | Default | Description |
---|---|---|
OPENAI_API_KEY | (required) | Your OpenAI API key |
OUTPUT_DIR | data/output | Directory for JSON output files |
PATTERN_CACHE_DIR | pattern_cache | Directory for cached patterns |
# See all available options
docker compose run --rm html2rss-ai --help
# Main arguments:
config_file Path to JSON configuration file (required)
--continue-on-error Continue processing even if some URLs fail
--log-level LEVEL Set logging level (DEBUG, INFO, WARNING, ERROR)
The Docker setup uses:
./data/output/
and ./pattern_cache/
/app/data/output/
and /app/pattern_cache/
{
"links": [
{
"url": "https://example.com/article-1",
"title": "Article Title",
"selector_used": "h2 > a"
}
],
"total_found": 42,
"pattern_used": "articles",
"confidence": 0.95,
"base_url": "https://example.com/blog",
"pattern_analysis": {
"pattern_type": "articles",
"primary_selectors": ["h2 > a"],
"confidence_score": 0.95
}
}
# Build with Docker Compose (creates html2rss-ai:latest)
docker compose build
# Or build directly with custom tag
docker build -t html2rss-ai:v1.0 .
pip install -e ".[playwright]"
playwright install chromium
pytest tests/
MIT License - see LICENSE file.
FAQs
š AI-powered web scraping with modern CSS support. Extract content from any website using GPT-4, handles CSS Grid/Flexbox layouts, Tailwind CSS, and complex selectors automatically.
We found that html2rss-ai demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago.Ā It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
CVEForecast.org uses machine learning to project a record-breaking surge in vulnerability disclosures in 2025.
Security News
Browserslist-rs now uses static data to reduce binary size by over 1MB, improving memory use and performance for Rust-based frontend tools.
Research
Security News
Eight new malicious Firefox extensions impersonate games, steal OAuth tokens, hijack sessions, and exploit browser permissions to spy on users.