You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP →

Book a Demo Install Sign in

web-maestro

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

web-maestro

Production-ready web content extraction with multi-provider LLM support and intelligent browser automation

1.0.0

PyPI

Maintainers: 1

🌐 Web Maestro

Production-ready web content extraction with multi-provider LLM support and intelligent browser automation.

Web Maestro is a Python library that combines advanced web scraping capabilities with AI-powered content analysis. It provides browser automation using Playwright and integrates with multiple LLM providers for intelligent content extraction and analysis.

🔥 Real-World Example: Smart Baseball Data Pipeline

Imagine you need to build a comprehensive baseball analytics system that monitors multiple sports websites, extracts game statistics, player performance data, and news updates in real-time. Web Maestro makes this incredibly simple:

import asyncio
from web_maestro import WebMaestro, LLMConfig

async def smart_baseball_crawler():
    # Configure your AI-powered crawler
    config = LLMConfig(
        provider="openai",  # or anthropic, portkey, ollama
        api_key="your-api-key",
        model="gpt-4o"
    )

    maestro = WebMaestro(config)

    # Define what you want to extract
    extraction_prompt = """
    Extract baseball data and structure it as JSON:
    - Game scores and schedules
    - Player statistics (batting avg, ERA, etc.)
    - Injury reports and roster changes
    - Latest news headlines

    Focus on actionable data for fantasy baseball decisions.
    """

    # Crawl multiple sources intelligently
    sources = [
        "https://www.espn.com/mlb/",
        "https://www.mlb.com/",
        "https://www.baseball-reference.com/"
    ]

    for url in sources:
        # AI automatically understands site structure and extracts relevant data
        result = await maestro.extract_structured_data(
            url=url,
            prompt=extraction_prompt,
            output_format="json"
        )

        if result.success:
            print(f"📊 Extracted from {url}:")
            print(f"⚾ Games: {len(result.data.get('games', []))}")
            print(f"👤 Players: {len(result.data.get('players', []))}")
            print(f"📰 News: {len(result.data.get('news', []))}")

            # Data is automatically structured and ready for your database
            await save_to_database(result.data)

# Run your intelligent baseball pipeline
asyncio.run(smart_baseball_crawler())

Why This Example Matters:

🧠 AI-Powered: No manual CSS selectors or HTML parsing - AI understands content contextually
🚀 Production Ready: Handles dynamic content, JavaScript-heavy sites, and rate limiting automatically
🔄 Adaptive: Works across different sports sites without code changes
📊 Structured Output: Returns clean, structured data ready for analysis or storage

🌟 Key Features

🚀 Advanced Web Extraction

Browser Automation: Powered by Playwright for handling dynamic content and JavaScript-heavy sites
DOM Capture: Intelligent element interaction including clicks, hovers, and content discovery
Session Management: Proper context management for complex extraction workflows

🤖 Multi-Provider LLM Support

Universal Interface: Works with OpenAI, Anthropic Claude, Portkey, and Ollama
Streaming Support: Real-time content delivery for better user experience
Intelligent Analysis: AI-powered content extraction and structuring

🔧 Developer Experience

Clean API: Intuitive, well-documented interface
Type Safety: Full type hints and Pydantic models
Async Support: Built for modern async/await patterns
Extensible: Modular architecture for custom providers

📦 Installation

Basic Installation

pip install web-maestro

Quick Verification

After installation, verify everything works:

# Test basic import
from web_maestro import LLMConfig, SessionContext
print("✅ Web Maestro installed successfully!")

# Check available providers
from web_maestro.providers.factory import ProviderRegistry
print(f"📦 Available providers: {ProviderRegistry.list_providers()}")

With Specific LLM Provider

Choose your preferred AI provider:

# For OpenAI GPT models
pip install "web-maestro[openai]"

# For Anthropic Claude models
pip install "web-maestro[anthropic]"

# For Portkey AI gateway
pip install "web-maestro[portkey]"

# For local Ollama models
pip install "web-maestro[ollama]"

# Install all providers
pip install "web-maestro[all-providers]"

System Dependencies

Web Maestro requires Poppler for PDF processing functionality:

macOS (Homebrew):

brew install poppler

Ubuntu/Debian:

sudo apt-get install poppler-utils

Windows: Download from: https://blog.alivate.com.au/poppler-windows/

Development Installation

Quick Setup (Recommended):

git clone https://github.com/fede-dash/web-maestro.git
cd web-maestro

# Automated setup - installs system deps, Python deps, and browsers
hatch run setup-dev

Manual Setup:

git clone https://github.com/fede-dash/web-maestro.git
cd web-maestro

# Install system dependencies
brew install poppler  # macOS
# sudo apt-get install poppler-utils  # Linux

# Install Python dependencies
pip install -e ".[dev,all-features]"

# Install browsers for Playwright
playwright install

# Setup pre-commit hooks
pre-commit install

Available Hatch Scripts:

# Full system and dev setup
hatch run setup-dev

# Install just system dependencies
hatch run setup-system

# Full setup for production use
hatch run setup-full

# Run tests
hatch run test

# Run tests with coverage
hatch run test-cov

# Format and lint code
hatch run format
hatch run lint

🚀 Quick Start

Basic Web Content Extraction

import asyncio
from web_maestro import fetch_rendered_html, SessionContext
from web_maestro.providers.portkey import PortkeyProvider
from web_maestro import LLMConfig

async def extract_content():
    # Configure your LLM provider
    config = LLMConfig(
        provider="portkey",
        api_key="your-api-key",
        model="gpt-4",
        base_url="your-portkey-endpoint",
        extra_params={"virtual_key": "your-virtual-key"}
    )

    provider = PortkeyProvider(config)

    # Extract content using browser automation
    ctx = SessionContext()
    blocks = await fetch_rendered_html(
        url="https://example.com",
        ctx=ctx
    )

    if blocks:
        # Combine extracted content
        content = "\n".join([block.content for block in blocks[:20]])

        # Analyze with AI
        response = await provider.complete(
            f"Analyze this content and extract key information:\n{content[:5000]}"
        )

        if response.success:
            print("Extracted content:", response.content)
        else:
            print("Error:", response.error)

asyncio.run(extract_content())

Streaming Content Analysis

import asyncio
from web_maestro.providers.portkey import PortkeyProvider
from web_maestro import LLMConfig

async def stream_analysis():
    config = LLMConfig(
        provider="portkey",
        api_key="your-api-key",
        model="gpt-4",
        base_url="your-endpoint",
        extra_params={"virtual_key": "your-virtual-key"}
    )

    provider = PortkeyProvider(config)

    # Stream response chunks in real-time
    prompt = "Write a detailed analysis of modern web scraping techniques."

    async for chunk in provider.complete_stream(prompt):
        print(chunk, end="", flush=True)

asyncio.run(stream_analysis())

Using Enhanced Fetcher

import asyncio
from web_maestro.utils import EnhancedFetcher

async def fetch_with_caching():
    # Create fetcher with intelligent caching
    fetcher = EnhancedFetcher(cache_ttl=300)  # 5-minute cache

    # Attempt static fetch first, fallback to browser if needed
    blocks = await fetcher.try_static_first("https://example.com")

    print(f"Fetched {len(blocks)} content blocks")
    for block in blocks[:5]:
        print(f"[{block.content_type}] {block.content[:100]}...")

asyncio.run(fetch_with_caching())

🎯 Current Capabilities

✅ What's Working

Browser Automation: Full Playwright integration for dynamic content
Multi-Provider LLM: OpenAI, Anthropic, Portkey, and Ollama support
Streaming: Real-time response streaming from LLM providers
Content Extraction: DOM capture with multiple content types
Session Management: Proper browser context and session handling
Type Safety: Comprehensive type hints throughout the codebase

🚧 In Development

WebMaestro Class: High-level orchestration (basic implementation exists)
Advanced DOM Interaction: Tab expansion, hover detection (framework exists)
Rate Limiting: Smart request throttling (utility classes available)
Caching Layer: Response caching with TTL (basic implementation exists)

📋 Planned Features

Comprehensive test suite
Advanced error recovery
Performance monitoring
Plugin architecture
Documentation website

🔮 Future Roadmap: WebActions Framework

🚧 Coming Soon: Web Maestro is evolving beyond content extraction into intelligent web automation with WebActions - a revolutionary framework for automated web interactions.

The next major evolution will introduce WebActions - an intelligent automation framework that extends beyond content extraction to include sophisticated web interaction capabilities:

🎯 Planned WebActions Features:

🤖 Intelligent Form Automation: AI-driven form completion with context understanding and validation
🔄 Complex Workflow Execution: Multi-step web processes with decision trees and conditional logic
📱 Interactive Element Management: Smart handling of dropdowns, modals, and dynamic UI components
🔐 Authentication Workflows: Automated login sequences with credential management and session persistence
📊 Data Submission Pipelines: Intelligent data entry with validation and error handling
🎮 Game-like Interactions: Advanced interaction patterns for complex web applications
🧠 Action Learning: Machine learning-based action optimization and pattern recognition

🌟 WebActions Vision:

WebActions will transform Web Maestro from a content extraction tool into a comprehensive web automation agent capable of performing complex interactions while maintaining the same level of intelligence and adaptability demonstrated in current content analysis features. This evolution will enable use cases such as:

Automated Data Entry: Intelligent form completion across multiple systems
Complex Multi-Step Workflows: End-to-end process automation with decision making
Intelligent Web Application Testing: AI-driven testing with adaptive scenarios
Dynamic Content Management: Automated content publishing and management workflows

Beta Status: The current version focuses on content extraction and analysis. WebActions capabilities are in active development and will be released in future versions.

🔧 Configuration

LLM Provider Setup

from web_maestro import LLMConfig

# OpenAI Configuration
openai_config = LLMConfig(
    provider="openai",
    api_key="sk-...",
    model="gpt-4",
    temperature=0.7,
    max_tokens=2000
)

# Portkey Configuration (with gateway)
portkey_config = LLMConfig(
    provider="portkey",
    api_key="your-portkey-key",
    model="gpt-4",
    base_url="https://your-gateway.com/v1",
    extra_params={
        "virtual_key": "your-virtual-key"
    }
)

# Anthropic Configuration
anthropic_config = LLMConfig(
    provider="anthropic",
    api_key="sk-ant-...",
    model="claude-3-sonnet",
    temperature=0.5
)

Browser Configuration

browser_config = {
    "headless": True,
    "timeout_ms": 30000,
    "viewport": {"width": 1920, "height": 1080},
    "max_scrolls": 15,
    "max_elements_to_click": 25,
    "stability_timeout_ms": 2000
}

blocks = await fetch_rendered_html(
    url="https://complex-spa.com",
    ctx=ctx,
    config=browser_config
)

📚 API Overview

Core Functions

# Browser-based content extraction
from web_maestro import fetch_rendered_html, SessionContext

ctx = SessionContext()
blocks = await fetch_rendered_html(url, ctx, config)

Provider Classes

# All providers implement the same interface
from web_maestro.providers.portkey import PortkeyProvider

provider = PortkeyProvider(config)
response = await provider.complete(prompt)
stream = provider.complete_stream(prompt)

Utility Classes

# Enhanced fetching with caching
from web_maestro.utils import EnhancedFetcher, RateLimiter

fetcher = EnhancedFetcher(cache_ttl=300)
rate_limiter = RateLimiter(max_requests=10, time_window=60)

Data Models

# Structured data types
from web_maestro.models.types import CapturedBlock, CaptureType
from web_maestro.providers.base import LLMResponse, ModelCapability

🛡️ Error Handling

try:
    response = await provider.complete("Your prompt")

    if response.success:
        print("Response:", response.content)
        print(f"Tokens used: {response.total_tokens}")
    else:
        print("Error:", response.error)

except Exception as e:
    print(f"Unexpected error: {e}")

🔄 Streaming Support

# Stream responses for real-time delivery
async for chunk in provider.complete_stream("Your prompt"):
    print(chunk, end="", flush=True)

# Chat streaming
messages = [{"role": "user", "content": "Hello"}]
async for chunk in provider.complete_chat_stream(messages):
    print(chunk, end="", flush=True)

🧪 Testing Your Setup

# Test provider connectivity
import asyncio
from web_maestro.providers.openai import OpenAIProvider
from web_maestro import LLMConfig

async def test_setup():
    config = LLMConfig(
        provider="openai",
        api_key="your-openai-api-key",
        model="gpt-3.5-turbo"
    )

    provider = OpenAIProvider(config)
    response = await provider.complete("Hello, world!")

    if response.success:
        print("✅ Provider working:", response.content)
    else:
        print("❌ Provider failed:", response.error)

# Run the test
asyncio.run(test_setup())

🛠️ Troubleshooting

Common Issues and Solutions

Import Error: "No module named 'web_maestro'"

# Make sure you installed the package
pip install web-maestro

# If using conda
conda install -c conda-forge web-maestro  # Not yet available

Browser Dependencies Missing

# Install Playwright browsers
playwright install

# On Linux, you might need additional dependencies
sudo apt-get install libnss3 libxss1 libasound2

PDF Processing Issues

# Install Poppler (required for PDF processing)
# macOS
brew install poppler

# Ubuntu/Debian
sudo apt-get install poppler-utils

# Windows: Download from https://blog.alivate.com.au/poppler-windows/

LLM Provider Authentication

# Verify your API keys are set correctly
import os
print("OpenAI API Key:", os.getenv("OPENAI_API_KEY", "Not set"))
print("Anthropic API Key:", os.getenv("ANTHROPIC_API_KEY", "Not set"))

Rate Limiting Issues

# Use built-in rate limiting
from web_maestro.utils import RateLimiter

rate_limiter = RateLimiter(max_requests=10, time_window=60)
await rate_limiter.acquire()  # Will wait if needed

📁 Project Structure

web-maestro/
├── src/web_maestro/
│   ├── __init__.py              # Main exports
│   ├── multi_provider.py        # WebMaestro orchestrator
│   ├── fetch.py                 # Core fetching logic
│   ├── context.py               # Session management
│   ├── providers/               # LLM provider implementations
│   │   ├── base.py              # Base provider interface
│   │   ├── portkey.py           # Portkey provider
│   │   ├── openai.py            # OpenAI provider
│   │   ├── anthropic.py         # Anthropic provider
│   │   └── ollama.py            # Ollama provider
│   ├── utils/                   # Utility classes
│   │   ├── enhanced_fetch.py    # Smart fetching
│   │   ├── rate_limiter.py      # Rate limiting
│   │   ├── text_processor.py    # Text processing
│   │   └── json_processor.py    # JSON handling
│   ├── models/                  # Data models
│   │   └── types.py             # Type definitions
│   └── dom_capture/             # Browser automation
│       ├── universal_capture.py # DOM interaction
│       └── scroll.py            # Scrolling logic
├── tests/                       # Test files
├── docs/                        # Documentation
├── pyproject.toml               # Project configuration
└── README.md                    # This file

📖 Examples

Real Website Extraction

Test with a real website (example using Chelsea FC):

import asyncio
from web_maestro import fetch_rendered_html, SessionContext
from web_maestro.providers.portkey import PortkeyProvider
from web_maestro import LLMConfig

async def extract_chelsea_info():
    # Configure provider
    config = LLMConfig(
        provider="portkey",
        api_key="your-key",
        model="gpt-4o",
        base_url="your-endpoint",
        extra_params={"virtual_key": "your-virtual-key"}
    )

    provider = PortkeyProvider(config)

    # Extract website content
    ctx = SessionContext()
    blocks = await fetch_rendered_html("https://www.chelseafc.com/en", ctx)

    if blocks:
        # Analyze with AI
        content = "\n".join([block.content for block in blocks[:50]])

        response = await provider.complete(f"""
        Extract soccer information from this Chelsea FC website:
        1. Latest news and match updates
        2. Upcoming fixtures
        3. Team news

        Website content:
        {content[:5000]}
        """)

        if response.success:
            print("⚽ Extracted Information:")
            print(response.content)

asyncio.run(extract_chelsea_info())

🤝 Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Quick Development Setup

# Clone the repository
git clone https://github.com/fede-dash/web-maestro.git
cd web-maestro

# Automated setup (recommended)
./scripts/setup-dev.sh

# Or manual setup
pip install hatch
hatch run install-dev
hatch run install-hooks

Development Commands

# Code quality (using Hatch - recommended)
hatch run format      # Format code with Black and Ruff
hatch run lint        # Run linting checks
hatch run check       # Run all quality checks

# Testing
hatch run test        # Run tests
hatch run test-cov    # Run tests with coverage

# Or use Make commands
make format           # Format code
make lint            # Run linting
make check           # Run all checks
make test            # Run tests
make dev-setup       # Full development setup

Pre-commit Hooks

Pre-commit hooks are mandatory and will run automatically:

# Install hooks (done automatically by setup script)
hatch run install-hooks

# Run hooks manually
pre-commit run --all-files

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📈 Version History

v1.0.0 (Current)

✅ Initial Release: Production-ready web content extraction
✅ Multi-Provider LLM Support: OpenAI, Anthropic, Portkey, Ollama
✅ Browser Automation: Full Playwright integration
✅ Streaming Support: Real-time response streaming
✅ Type Safety: Comprehensive type hints throughout
✅ Session Management: Proper browser context handling

🚀 Coming Soon

v1.1.0: WebActions framework for intelligent web automation
v1.2.0: Advanced caching and rate limiting
v1.3.0: Plugin architecture and custom providers

🆘 Support & Contact

PyPI Package: web-maestro on PyPI
Issues: GitHub Issues
Questions: Create a discussion or issue
Documentation: GitHub Repository
Email: For enterprise support inquiries

Playwright: Browser automation framework
Beautiful Soup: HTML parsing library
aiohttp: Async HTTP client
Pydantic: Data validation and settings management

Web Maestro - Intelligent Web Content Extraction

⭐ Star us on GitHub | 📚 Documentation | 🐛 Report Issue

Keywords

FAQs

What is web-maestro?

Is web-maestro well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

web-maestro

🌐 Web Maestro

🔥 Real-World Example: Smart Baseball Data Pipeline

🌟 Key Features

🚀 Advanced Web Extraction

🤖 Multi-Provider LLM Support

🔧 Developer Experience

📦 Installation

Basic Installation

Quick Verification

With Specific LLM Provider

System Dependencies

Development Installation

🚀 Quick Start

Basic Web Content Extraction

Streaming Content Analysis

Using Enhanced Fetcher

🎯 Current Capabilities

✅ What's Working

🚧 In Development

📋 Planned Features

🔮 Future Roadmap: WebActions Framework

🎯 Planned WebActions Features:

🌟 WebActions Vision:

🔧 Configuration

LLM Provider Setup

Browser Configuration

📚 API Overview

Core Functions

Provider Classes

Utility Classes

Data Models

🛡️ Error Handling

🔄 Streaming Support

🧪 Testing Your Setup

🛠️ Troubleshooting

Common Issues and Solutions

📁 Project Structure

📖 Examples

Real Website Extraction

🤝 Contributing

Quick Development Setup

Development Commands

Pre-commit Hooks

📄 License

📈 Version History

v1.0.0 (Current)

🚀 Coming Soon

🆘 Support & Contact

🔗 Related Projects

Keywords

Related posts

Toptal’s GitHub Organization Hijacked: 10 Malicious Packages Published

Surveillance Malware Hidden in npm and PyPI Packages Targets Developers with Keyloggers, Webcam Capture, and Credential Theft