fetch-url-package
Professional web content fetching and extraction toolkit with configurable extraction methods, detailed error handling, and domain caching.
Installation
Basic Installation
pip install fetch-url-package
With Trafilatura Support
pip install fetch-url-package[trafilatura]
Development Installation
pip install fetch-url-package[dev]
Quick Start
from fetch_url_package import fetch
result = fetch("https://example.com")
if result.success:
print("Content:", result.content)
else:
print(f"Error ({result.error_type}): {result.error_message}")
from fetch_url_package import fetch, FetchConfig, ExtractionMethod
config = FetchConfig(
extraction_method=ExtractionMethod.TRAFILATURA,
extraction_kwargs={"include_tables": True}
)
result = fetch("https://example.com", config=config)
if result.success:
print(result.content)
from fetch_url_package import fetch_html
result = fetch_html("https://example.com")
if result.success:
print("HTML:", result.html)
Using Content Cache
The content cache stores successfully fetched webpage content (HTML and extracted text) to avoid redundant requests. Optimized for high concurrency with sharding.
from fetch_url_package import fetch, FetchConfig, ContentCache
config = FetchConfig(content_cache_size=500)
result1 = fetch("https://example.com", config=config)
result2 = fetch("https://example.com", config=config)
print(result2.metadata.get("from_cache"))
high_concurrency_cache = ContentCache(max_size=1000, num_shards=32)
config_high_perf = FetchConfig(content_cache=high_concurrency_cache)
if config.content_cache:
stats = config.content_cache.get_stats()
print(f"Cached pages: {stats['total_entries']}/{stats['max_size']}")
print(f"Shards: {stats['num_shards']}")
Advanced Configuration
from fetch_url_package import fetch, FetchConfig, ExtractionMethod, DomainCache, ContentCache
domain_cache = DomainCache(
cache_file="/tmp/fetch_cache.json",
ttl=86400,
failure_threshold=3
)
content_cache = ContentCache(max_size=1000)
config = FetchConfig(
max_retries=5,
retry_delay=2.0,
timeout=60.0,
connect_timeout=15.0,
extraction_method=ExtractionMethod.SIMPLE,
custom_headers={
"X-Custom-Header": "value"
},
use_cache=True,
cache=domain_cache,
content_cache=content_cache,
return_html=True,
blocked_domains=["example-blocked.com"]
)
result = fetch("https://example.com", config=config)
API Reference
Main Functions
Fetch and optionally extract content from URL.
Parameters:
url (str): URL to fetch
config (FetchConfig, optional): Configuration object
extract (bool): Whether to extract content (default: True)
Returns: FetchResult object
fetch_html(url, config=None)
Fetch HTML content only without extraction.
Parameters:
url (str): URL to fetch
config (FetchConfig, optional): Configuration object
Returns: FetchResult object
Configuration Classes
FetchConfig
Configuration for fetch operations.
Parameters:
max_retries (int): Maximum retry attempts (default: 3)
retry_delay (float): Base delay between retries in seconds (default: 1.0)
timeout (float): Request timeout in seconds (default: 30.0)
connect_timeout (float): Connection timeout in seconds (default: 10.0)
follow_redirects (bool): Follow HTTP redirects (default: True)
max_redirects (int): Maximum number of redirects (default: 10)
http2 (bool): Use HTTP/2 (default: True)
verify_ssl (bool): Verify SSL certificates (default: False)
user_agents (List[str], optional): List of user agents to rotate
referers (List[str], optional): List of referers to rotate
custom_headers (Dict[str, str], optional): Custom HTTP headers
extraction_method (ExtractionMethod): Extraction method (default: SIMPLE)
extraction_kwargs (Dict): Additional arguments for extractor
filter_file_extensions (bool): Filter file URLs (default: True)
blocked_domains (List[str], optional): Domains to block
use_cache (bool): Use domain cache for failed domains (default: False)
cache (DomainCache, optional): Domain cache instance for failed domains
content_cache_size (int): Size of content cache for successful fetches (default: 0, disabled)
content_cache (ContentCache, optional): Content cache instance
return_html (bool): Include HTML in result (default: False)
DomainCache
Cache for tracking failed domains to avoid repeated failures.
Parameters:
cache_file (str, optional): Path to cache file for persistence
ttl (int): Time-to-live for cache entries in seconds (default: 86400)
failure_threshold (int): Failures before caching domain (default: 3)
max_size (int): Maximum cache entries (default: 10000)
Methods:
should_skip(url): Check if URL should be skipped
record_failure(url, error_type): Record a failure
record_success(url): Record a success
clear(): Clear all cache entries
get_stats(): Get cache statistics
ContentCache
High-performance LRU cache for storing successfully fetched webpage content (both HTML and extracted text).
Parameters:
max_size (int): Maximum number of entries to cache (default: 500)
num_shards (int): Number of cache shards for concurrency (default: 16)
Features:
- LRU (Least Recently Used) eviction policy
- Optimized for high concurrency with sharding to reduce lock contention
- Stores both HTML and extracted content
- Only caches successful fetches (not failures)
- Thread-safe operations with minimal blocking
- Short-term temporary cache for performance optimization
Concurrency Optimization:
The cache uses sharding to minimize lock contention under high concurrency. URLs are distributed across multiple shards based on hash, allowing concurrent operations on different URLs to proceed in parallel without blocking each other. This design supports thousands of concurrent requests efficiently.
Methods:
get(url): Get cached content for a URL
put(url, html, content, final_url, metadata): Store content in cache
clear(): Clear all cache entries
get_stats(): Get cache statistics
Result Classes
FetchResult
Result object containing fetch outcome and data.
Attributes:
url (str): Original URL
success (bool): Whether fetch was successful
content (str, optional): Extracted content
html (str, optional): Raw HTML content
error_type (ErrorType, optional): Type of error if failed
error_message (str, optional): Error message if failed
status_code (int, optional): HTTP status code
final_url (str, optional): Final URL after redirects
metadata (Dict): Additional metadata
Simple and fast extraction that removes HTML/XML tags without complex parsing.
Pros:
- No external dependencies
- Fast performance
- Reliable for most web pages
Cons:
- Less sophisticated than trafilatura
- May include some unwanted content
Advanced extraction using the trafilatura library.
Pros:
- Better content extraction quality
- Filters out navigation, ads, etc.
- Handles complex page structures
Cons:
- Requires trafilatura dependency
- Slightly slower
Error Types
The package provides detailed error types:
NOT_FOUND (404): Page not found
FORBIDDEN (403): Access denied
RATE_LIMITED (429): Too many requests
SERVER_ERROR (5xx): Server error
TIMEOUT: Request timeout
NETWORK_ERROR: Network/connection error
SSL_ERROR: SSL/TLS error
FILTERED: URL filtered by configuration
EMPTY_CONTENT: Page returned empty content
EXTRACTION_FAILED: Content extraction failed
CACHED_FAILURE: Domain in failure cache
UNKNOWN: Unknown error
Best Practices & Recommendations
1. Bypassing Human Verification (CAPTCHA)
Challenge: Many websites use CAPTCHA or human verification to block automated requests.
Recommendations:
-
Use Proxy Services: Consider using services like:
- Oxylabs (already referenced in your code)
- ScraperAPI
- Bright Data (formerly Luminati)
-
Implement Delays: Add random delays between requests
import time
import random
for url in urls:
result = fetch(url)
time.sleep(random.uniform(2, 5))
-
Rotate User Agents: Already built-in, but you can add more
config = FetchConfig(
user_agents=[
"Your custom user agent 1",
"Your custom user agent 2",
]
)
-
Use Sessions: For multiple requests to same domain
-
Selenium/Playwright: For JavaScript-heavy sites (not included in this package)
2. Handling Redirects
The package automatically handles:
- HTTP redirects (301, 302, 307, 308)
- Meta refresh redirects
- JavaScript redirects (partial support)
Configuration:
config = FetchConfig(
follow_redirects=True,
max_redirects=10
)
For Complex JavaScript Redirects:
Consider using browser automation tools like Selenium or Playwright for pages that heavily rely on JavaScript.
3. Domain Caching Strategy
Use Cases:
- Large-scale scraping operations
- Batch URL processing
- Avoiding repeated failures
Example:
from fetch_url_package import DomainCache, FetchConfig, fetch
cache = DomainCache(
cache_file="/var/cache/fetch_domains.json",
ttl=86400,
failure_threshold=3
)
config = FetchConfig(use_cache=True, cache=cache)
urls = ["http://example1.com", "http://example2.com"]
for url in urls:
result = fetch(url, config=config)
if result.error_type == "cached_failure":
print(f"Skipped cached domain: {url}")
Cache Statistics:
stats = cache.get_stats()
print(f"Cached domains: {stats['total_entries']}")
print(f"Domains: {stats['domains']}")
4. Rate Limiting
Implement Your Own Rate Limiting:
import time
from collections import defaultdict
class RateLimiter:
def __init__(self, requests_per_second=1):
self.rps = requests_per_second
self.last_request = defaultdict(float)
def wait_if_needed(self, domain):
now = time.time()
elapsed = now - self.last_request[domain]
if elapsed < (1.0 / self.rps):
time.sleep((1.0 / self.rps) - elapsed)
self.last_request[domain] = time.time()
limiter = RateLimiter(requests_per_second=2)
for url in urls:
from urllib.parse import urlparse
domain = urlparse(url).netloc
limiter.wait_if_needed(domain)
result = fetch(url)
5. Concurrent Fetching
Using ThreadPoolExecutor:
from concurrent.futures import ThreadPoolExecutor, as_completed
from fetch_url_package import fetch, FetchConfig
def fetch_url(url):
return fetch(url)
urls = ["http://example1.com", "http://example2.com", "http://example3.com"]
with ThreadPoolExecutor(max_workers=5) as executor:
futures = {executor.submit(fetch_url, url): url for url in urls}
for future in as_completed(futures):
url = futures[future]
try:
result = future.result()
if result.success:
print(f"Success: {url}")
else:
print(f"Failed: {url} - {result.error_message}")
except Exception as e:
print(f"Exception: {url} - {e}")
6. Custom Proxy Support
Using HTTP Proxy:
import os
os.environ['HTTP_PROXY'] = 'http://proxy:port'
os.environ['HTTPS_PROXY'] = 'https://proxy:port'
7. Handling Different Content Types
Check Response Content:
result = fetch_html("https://example.com")
if result.success and result.html:
if result.html.strip().startswith('<!DOCTYPE') or '<html' in result.html.lower():
pass
Examples
from fetch_url_package import fetch
result = fetch("https://en.wikipedia.org/wiki/Python_(programming_language)")
if result.success:
print(f"Extracted {len(result.content)} characters")
print(result.content[:500])
else:
print(f"Error: {result.error_message}")
Example 2: Using Content Cache for Performance
from fetch_url_package import fetch, FetchConfig
import time
config = FetchConfig(content_cache_size=500)
urls = [
"https://example.com",
"https://example.com",
"https://example.com",
]
for i, url in enumerate(urls, 1):
start = time.time()
result = fetch(url, config=config)
elapsed = time.time() - start
if result.success:
from_cache = result.metadata.get("from_cache", False)
print(f"Fetch {i}: {elapsed:.3f}s (cached: {from_cache})")
if config.content_cache:
stats = config.content_cache.get_stats()
print(f"Cache: {stats['total_entries']} entries")
Example 3: Batch Processing with Domain Cache
from fetch_url_package import fetch, FetchConfig, DomainCache
cache = DomainCache(cache_file="batch_cache.json")
config = FetchConfig(use_cache=True, cache=cache)
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
]
results = []
for url in urls:
result = fetch(url, config=config)
results.append(result)
print(cache.get_stats())
from fetch_url_package import fetch, FetchConfig, ExtractionMethod
config = FetchConfig(
extraction_method=ExtractionMethod.TRAFILATURA,
extraction_kwargs={
"include_tables": True,
"include_links": True,
"include_comments": False,
}
)
result = fetch("https://example.com", config=config)
Migration from Old Code
If you're migrating from the old fetch_url.py:
Old Code:
from fetch_url import fetch_and_extract
content, error = fetch_and_extract(url)
if error:
print(f"Error: {error}")
else:
print(content)
New Code:
from fetch_url_package import fetch
result = fetch(url)
if result.success:
print(result.content)
else:
print(f"Error: {result.error_message}")
Using Trafilatura (like old default):
from fetch_url_package import fetch, FetchConfig, ExtractionMethod
config = FetchConfig(extraction_method=ExtractionMethod.TRAFILATURA)
result = fetch(url, config=config)
Development
Running Tests
pip install -e .[dev]
pytest tests/
Code Formatting
black fetch_url_package/
flake8 fetch_url_package/
License
MIT License
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Support
For issues and questions, please use the GitHub issue tracker.