
Research
2025 Report: Destructive Malware in Open Source Packages
Destructive malware is rising across open source registries, using delays and kill switches to wipe code, break builds, and disrupt CI/CD.
fetch-url-package
Advanced tools
Professional web content fetching and extraction toolkit with configurable extraction methods and domain caching
Professional web content fetching and extraction toolkit with configurable extraction methods, detailed error handling, and domain caching.
pip install fetch-url-package
pip install fetch-url-package[trafilatura]
pip install fetch-url-package[dev]
from fetch_url_package import fetch
# Fetch and extract content with default settings
result = fetch("https://example.com")
if result.success:
print("Content:", result.content)
else:
print(f"Error ({result.error_type}): {result.error_message}")
from fetch_url_package import fetch, FetchConfig, ExtractionMethod
config = FetchConfig(
extraction_method=ExtractionMethod.TRAFILATURA,
extraction_kwargs={"include_tables": True}
)
result = fetch("https://example.com", config=config)
if result.success:
print(result.content)
from fetch_url_package import fetch_html
result = fetch_html("https://example.com")
if result.success:
print("HTML:", result.html)
The content cache stores successfully fetched webpage content (HTML and extracted text) to avoid redundant requests. Optimized for high concurrency with sharding.
from fetch_url_package import fetch, FetchConfig, ContentCache
# Simple: Enable content cache with size limit of 500 pages
config = FetchConfig(content_cache_size=500)
# First fetch - will hit the network
result1 = fetch("https://example.com", config=config)
# Second fetch - will use cached content (much faster!)
result2 = fetch("https://example.com", config=config)
print(result2.metadata.get("from_cache")) # True
# For ultra-high concurrency: customize sharding
# More shards = better concurrency but slightly more memory
high_concurrency_cache = ContentCache(max_size=1000, num_shards=32)
config_high_perf = FetchConfig(content_cache=high_concurrency_cache)
# Check cache statistics
if config.content_cache:
stats = config.content_cache.get_stats()
print(f"Cached pages: {stats['total_entries']}/{stats['max_size']}")
print(f"Shards: {stats['num_shards']}")
from fetch_url_package import fetch, FetchConfig, ExtractionMethod, DomainCache, ContentCache
# Create a custom domain cache (for failures)
domain_cache = DomainCache(
cache_file="/tmp/fetch_cache.json",
ttl=86400, # 24 hours
failure_threshold=3
)
# Create a custom content cache (for successful fetches)
content_cache = ContentCache(max_size=1000)
# Configure fetch settings
config = FetchConfig(
# Retry settings
max_retries=5,
retry_delay=2.0,
# Timeout settings
timeout=60.0,
connect_timeout=15.0,
# Extraction settings
extraction_method=ExtractionMethod.SIMPLE,
# Custom headers
custom_headers={
"X-Custom-Header": "value"
},
# Domain cache settings (for failures)
use_cache=True,
cache=domain_cache,
# Content cache settings (for successful fetches)
content_cache=content_cache,
# Return HTML along with extracted content
return_html=True,
# Blocked domains
blocked_domains=["example-blocked.com"]
)
result = fetch("https://example.com", config=config)
fetch(url, config=None, extract=True)fetch_async(url, config=None, extract=True)Fetch and optionally extract content from URL.
Parameters:
url (str): URL to fetchconfig (FetchConfig, optional): Configuration objectextract (bool): Whether to extract content (default: True)Returns: FetchResult object
fetch_html(url, config=None)Fetch HTML content only without extraction.
Parameters:
url (str): URL to fetchconfig (FetchConfig, optional): Configuration objectReturns: FetchResult object
FetchConfigConfiguration for fetch operations.
Parameters:
max_retries (int): Maximum retry attempts (default: 3)retry_delay (float): Base delay between retries in seconds (default: 1.0)timeout (float): Request timeout in seconds (default: 30.0)connect_timeout (float): Connection timeout in seconds (default: 10.0)follow_redirects (bool): Follow HTTP redirects (default: True)max_redirects (int): Maximum number of redirects (default: 10)http2 (bool): Use HTTP/2 (default: True)verify_ssl (bool): Verify SSL certificates (default: False)user_agents (List[str], optional): List of user agents to rotatereferers (List[str], optional): List of referers to rotatecustom_headers (Dict[str, str], optional): Custom HTTP headersextraction_method (ExtractionMethod): Extraction method (default: SIMPLE)extraction_kwargs (Dict): Additional arguments for extractorfilter_file_extensions (bool): Filter file URLs (default: True)blocked_domains (List[str], optional): Domains to blockuse_cache (bool): Use domain cache for failed domains (default: False)cache (DomainCache, optional): Domain cache instance for failed domainscontent_cache_size (int): Size of content cache for successful fetches (default: 0, disabled)content_cache (ContentCache, optional): Content cache instancereturn_html (bool): Include HTML in result (default: False)DomainCacheCache for tracking failed domains to avoid repeated failures.
Parameters:
cache_file (str, optional): Path to cache file for persistencettl (int): Time-to-live for cache entries in seconds (default: 86400)failure_threshold (int): Failures before caching domain (default: 3)max_size (int): Maximum cache entries (default: 10000)Methods:
should_skip(url): Check if URL should be skippedrecord_failure(url, error_type): Record a failurerecord_success(url): Record a successclear(): Clear all cache entriesget_stats(): Get cache statisticsContentCacheHigh-performance LRU cache for storing successfully fetched webpage content (both HTML and extracted text).
Parameters:
max_size (int): Maximum number of entries to cache (default: 500)num_shards (int): Number of cache shards for concurrency (default: 16)Features:
Concurrency Optimization: The cache uses sharding to minimize lock contention under high concurrency. URLs are distributed across multiple shards based on hash, allowing concurrent operations on different URLs to proceed in parallel without blocking each other. This design supports thousands of concurrent requests efficiently.
Methods:
get(url): Get cached content for a URLput(url, html, content, final_url, metadata): Store content in cacheclear(): Clear all cache entriesget_stats(): Get cache statisticsFetchResultResult object containing fetch outcome and data.
Attributes:
url (str): Original URLsuccess (bool): Whether fetch was successfulcontent (str, optional): Extracted contenthtml (str, optional): Raw HTML contenterror_type (ErrorType, optional): Type of error if failederror_message (str, optional): Error message if failedstatus_code (int, optional): HTTP status codefinal_url (str, optional): Final URL after redirectsmetadata (Dict): Additional metadataExtractionMethod.SIMPLE (Default)Simple and fast extraction that removes HTML/XML tags without complex parsing.
Pros:
Cons:
ExtractionMethod.TRAFILATURAAdvanced extraction using the trafilatura library.
Pros:
Cons:
The package provides detailed error types:
NOT_FOUND (404): Page not foundFORBIDDEN (403): Access deniedRATE_LIMITED (429): Too many requestsSERVER_ERROR (5xx): Server errorTIMEOUT: Request timeoutNETWORK_ERROR: Network/connection errorSSL_ERROR: SSL/TLS errorFILTERED: URL filtered by configurationEMPTY_CONTENT: Page returned empty contentEXTRACTION_FAILED: Content extraction failedCACHED_FAILURE: Domain in failure cacheUNKNOWN: Unknown errorChallenge: Many websites use CAPTCHA or human verification to block automated requests.
Recommendations:
Use Proxy Services: Consider using services like:
Implement Delays: Add random delays between requests
import time
import random
for url in urls:
result = fetch(url)
time.sleep(random.uniform(2, 5)) # 2-5 second delay
Rotate User Agents: Already built-in, but you can add more
config = FetchConfig(
user_agents=[
"Your custom user agent 1",
"Your custom user agent 2",
]
)
Use Sessions: For multiple requests to same domain
# Future enhancement - session management
Selenium/Playwright: For JavaScript-heavy sites (not included in this package)
The package automatically handles:
Configuration:
config = FetchConfig(
follow_redirects=True,
max_redirects=10 # Adjust as needed
)
For Complex JavaScript Redirects: Consider using browser automation tools like Selenium or Playwright for pages that heavily rely on JavaScript.
Use Cases:
Example:
from fetch_url_package import DomainCache, FetchConfig, fetch
# Persistent cache
cache = DomainCache(
cache_file="/var/cache/fetch_domains.json",
ttl=86400, # 24 hours
failure_threshold=3 # Cache after 3 failures
)
config = FetchConfig(use_cache=True, cache=cache)
# Fetch multiple URLs
urls = ["http://example1.com", "http://example2.com"]
for url in urls:
result = fetch(url, config=config)
if result.error_type == "cached_failure":
print(f"Skipped cached domain: {url}")
Cache Statistics:
stats = cache.get_stats()
print(f"Cached domains: {stats['total_entries']}")
print(f"Domains: {stats['domains']}")
Implement Your Own Rate Limiting:
import time
from collections import defaultdict
class RateLimiter:
def __init__(self, requests_per_second=1):
self.rps = requests_per_second
self.last_request = defaultdict(float)
def wait_if_needed(self, domain):
now = time.time()
elapsed = now - self.last_request[domain]
if elapsed < (1.0 / self.rps):
time.sleep((1.0 / self.rps) - elapsed)
self.last_request[domain] = time.time()
# Usage
limiter = RateLimiter(requests_per_second=2)
for url in urls:
from urllib.parse import urlparse
domain = urlparse(url).netloc
limiter.wait_if_needed(domain)
result = fetch(url)
Using ThreadPoolExecutor:
from concurrent.futures import ThreadPoolExecutor, as_completed
from fetch_url_package import fetch, FetchConfig
def fetch_url(url):
return fetch(url)
urls = ["http://example1.com", "http://example2.com", "http://example3.com"]
with ThreadPoolExecutor(max_workers=5) as executor:
futures = {executor.submit(fetch_url, url): url for url in urls}
for future in as_completed(futures):
url = futures[future]
try:
result = future.result()
if result.success:
print(f"Success: {url}")
else:
print(f"Failed: {url} - {result.error_message}")
except Exception as e:
print(f"Exception: {url} - {e}")
Using HTTP Proxy:
# Note: Current version doesn't have built-in proxy support
# Future enhancement or workaround using environment variables:
import os
os.environ['HTTP_PROXY'] = 'http://proxy:port'
os.environ['HTTPS_PROXY'] = 'https://proxy:port'
# Or modify the fetch.py to add proxy support in httpx.AsyncClient
Check Response Content:
result = fetch_html("https://example.com")
if result.success and result.html:
# Check if it's actually HTML
if result.html.strip().startswith('<!DOCTYPE') or '<html' in result.html.lower():
# Process HTML
pass
from fetch_url_package import fetch
result = fetch("https://en.wikipedia.org/wiki/Python_(programming_language)")
if result.success:
print(f"Extracted {len(result.content)} characters")
print(result.content[:500]) # First 500 characters
else:
print(f"Error: {result.error_message}")
from fetch_url_package import fetch, FetchConfig
import time
# Enable content cache
config = FetchConfig(content_cache_size=500)
urls = [
"https://example.com",
"https://example.com", # Duplicate - will use cache
"https://example.com", # Duplicate - will use cache
]
for i, url in enumerate(urls, 1):
start = time.time()
result = fetch(url, config=config)
elapsed = time.time() - start
if result.success:
from_cache = result.metadata.get("from_cache", False)
print(f"Fetch {i}: {elapsed:.3f}s (cached: {from_cache})")
# Check cache stats
if config.content_cache:
stats = config.content_cache.get_stats()
print(f"Cache: {stats['total_entries']} entries")
from fetch_url_package import fetch, FetchConfig, DomainCache
cache = DomainCache(cache_file="batch_cache.json")
config = FetchConfig(use_cache=True, cache=cache)
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
]
results = []
for url in urls:
result = fetch(url, config=config)
results.append(result)
# Check cache stats
print(cache.get_stats())
from fetch_url_package import fetch, FetchConfig, ExtractionMethod
# Use trafilatura with custom options
config = FetchConfig(
extraction_method=ExtractionMethod.TRAFILATURA,
extraction_kwargs={
"include_tables": True,
"include_links": True,
"include_comments": False,
}
)
result = fetch("https://example.com", config=config)
If you're migrating from the old fetch_url.py:
from fetch_url import fetch_and_extract
content, error = fetch_and_extract(url)
if error:
print(f"Error: {error}")
else:
print(content)
from fetch_url_package import fetch
result = fetch(url)
if result.success:
print(result.content)
else:
print(f"Error: {result.error_message}")
from fetch_url_package import fetch, FetchConfig, ExtractionMethod
config = FetchConfig(extraction_method=ExtractionMethod.TRAFILATURA)
result = fetch(url, config=config)
pip install -e .[dev]
pytest tests/
black fetch_url_package/
flake8 fetch_url_package/
MIT License
Contributions are welcome! Please feel free to submit a Pull Request.
For issues and questions, please use the GitHub issue tracker.
FAQs
Professional web content fetching and extraction toolkit with configurable extraction methods and domain caching
We found that fetch-url-package demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Research
Destructive malware is rising across open source registries, using delays and kill switches to wipe code, break builds, and disrupt CI/CD.

Security News
Socket CTO Ahmad Nassri shares practical AI coding techniques, tools, and team workflows, plus what still feels noisy and why shipping remains human-led.

Research
/Security News
A five-month operation turned 27 npm packages into durable hosting for browser-run lures that mimic document-sharing portals and Microsoft sign-in, targeting 25 organizations across manufacturing, industrial automation, plastics, and healthcare for credential theft.