
Security News
npm Adopts OIDC for Trusted Publishing in CI/CD Workflows
npm now supports Trusted Publishing with OIDC, enabling secure package publishing directly from CI/CD workflows without relying on long-lived tokens.
A powerful async news content extraction library with modern API for web scraping and article analysis
A powerful async news content extraction library with modern API for web scraping and article analysis.
🚀 Modern Async API - Built with asyncio for high-performance concurrent scraping
📰 Universal News Support - Works with news websites and content from any language or region
🎯 Smart Content Extraction - Multiple extraction methods (readability, CSS selectors, JSON-LD)
🔄 Flexible Persistence - Memory-only or filesystem persistence modes
🛡️ Error Handling - Robust error handling with custom exception types
📊 Session Management - Built-in session management with race condition protection
🧪 Well Tested - Comprehensive unit tests with high coverage
pip install journ4list
poetry add journ4list
# Clone the repository
git clone https://github.com/username/journalist.git
cd journalist
# Install with Poetry
poetry install
# Activate virtual environment
poetry shell
# Clone the repository
git clone https://github.com/username/journalist.git
cd journalist
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install pip-tools
pip install pip-tools
# Compile and install dependencies
pip-compile requirements.in --output-file requirements.txt
pip install -r requirements.txt
import asyncio
from journalist import Journalist
async def main():
# Create journalist instance
journalist = Journalist(persist=True, scrape_depth=1)
# Extract content from news sites
result = await journalist.read(
urls=[ "https://www.bbc.com/news",
"https://www.reuters.com/"
],
keywords=["teknologi", "spor", "ekonomi"]
)
# Access extracted articles
for article in result['articles']:
print(f"Title: {article['title']}")
print(f"URL: {article['url']}")
print(f"Content: {article['content'][:200]}...")
print("-" * 50)
# Check extraction summary
summary = result['extraction_summary']
print(f"Processed {summary['urls_processed']} URLs")
print(f"Found {summary['articles_extracted']} articles")
print(f"Extraction took {summary['extraction_time_seconds']} seconds")
# Run the example
asyncio.run(main())
import asyncio
from journalist import Journalist
async def main():
# Use memory-only mode for temporary scraping
journalist = Journalist(persist=False)
result = await journalist.read( urls=["https://www.cnn.com/"],
keywords=["news", "breaking"]
)
# Articles are stored in memory only
print(f"Found {len(result['articles'])} articles")
print(f"Session ID: {result['session_id']}")
asyncio.run(main())
import asyncio
from journalist import Journalist
async def scrape_multiple_sources():
"""Example of concurrent scraping with multiple journalist instances."""
# Create tasks for different news sources
async def scrape_sports():
journalist = Journalist(persist=True, scrape_depth=2)
return await journalist.read(
urls=["https://www.espn.com/", "https://www.skysports.com/"],
keywords=["futbol", "basketbol"]
)
async def scrape_tech():
journalist = Journalist(persist=True, scrape_depth=1)
return await journalist.read(
urls=["https://www.techcrunch.com/", "https://www.wired.com/"],
keywords=["teknologi", "yazılım"]
)
# Run concurrently
sports_task = asyncio.create_task(scrape_sports())
tech_task = asyncio.create_task(scrape_tech())
sports_result, tech_result = await asyncio.gather(sports_task, tech_task)
print(f"Sports articles: {len(sports_result['articles'])}")
print(f"Tech articles: {len(tech_result['articles'])}")
asyncio.run(scrape_multiple_sources())
persist
(bool, default: True
) - Enable filesystem persistence for session datascrape_depth
(int, default: 1
) - Depth level for link discovery and scrapingThe library uses sensible defaults but can be configured via the JournalistConfig
class:
from journalist.config import JournalistConfig
# Get current workspace path
workspace = JournalistConfig.get_base_workspace_path()
print(f"Workspace: {workspace}") # Output: .journalist_workspace
The library provides custom exception types for better error handling:
import asyncio
from journalist import Journalist
from journalist.exceptions import NetworkError, ExtractionError, ValidationError
async def robust_scraping():
try:
journalist = Journalist()
result = await journalist.read(
urls=["https://example-news-site.com/"],
keywords=["important", "news"]
)
return result
except NetworkError as e:
print(f"Network error: {e}")
if hasattr(e, 'status_code'):
print(f"HTTP Status: {e.status_code}")
except ExtractionError as e:
print(f"Content extraction failed: {e}")
if hasattr(e, 'url'):
print(f"Failed URL: {e.url}")
except ValidationError as e:
print(f"Input validation error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
asyncio.run(robust_scraping())
__init__(persist=True, scrape_depth=1)
Initialize a new Journalist instance.
Parameters:
persist
(bool): Enable filesystem persistencescrape_depth
(int): Link discovery depth levelasync read(urls, keywords=None)
Extract content from provided URLs with optional keyword filtering.
Parameters:
urls
(List[str]): List of website URLs to processkeywords
(Optional[List[str]]): Keywords for relevance filteringReturns:
Dict[str, Any]
: Dictionary containing extracted articles and metadataReturn Structure:
{
'articles': [
{
'title': str,
'url': str,
'content': str,
'author': str,
'published_date': str,
'keywords_found': List[str]
}
],
'session_id': str,
'extraction_summary': {
'session_id': str,
'urls_requested': int,
'urls_processed': int,
'articles_extracted': int,
'extraction_time_seconds': float,
'keywords_used': List[str]
}
}
# Using Poetry
poetry run pytest
# Using pip
pytest
# With coverage
pytest --cov=journalist --cov-report=html
# Format code
black src tests
# Sort imports
isort src tests
# Type checking
mypy src
# Linting
pylint src
The project supports both Poetry and pip-tools for dependency management:
Poetry (pyproject.toml):
poetry install --with dev
pip-tools (requirements.in):
pip-compile requirements.in --output-file requirements.txt
python -m pip install -r requirements.txt
git checkout -b feature/amazing-feature
)pytest
)black src tests
)git commit -m 'Add amazing feature'
)git push origin feature/amazing-feature
)This project is licensed under the MIT License - see the LICENSE file for details.
Oktay Burak Ertas
Email: oktay.burak.ertas@gmail.com
FAQs
A powerful async news content extraction library with modern API for web scraping and article analysis
We found that journ4list demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
npm now supports Trusted Publishing with OIDC, enabling secure package publishing directly from CI/CD workflows without relying on long-lived tokens.
Research
/Security News
A RubyGems malware campaign used 60 malicious packages posing as automation tools to steal credentials from social media and marketing tool users.
Security News
The CNA Scorecard ranks CVE issuers by data completeness, revealing major gaps in patch info and software identifiers across thousands of vulnerabilities.