Socket
Book a DemoInstallSign in
Socket

cloudflare-peek

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

cloudflare-peek

A Python utility for scraping Cloudflare-protected websites using screenshot + OCR fallback

0.1.0
pipPyPI
Maintainers
1

CloudflarePeek πŸ”

Made by Talha Ali

A powerful Python utility that can scrape any websiteβ€”even those protected by Cloudflare. When traditional scraping fails, CloudflarePeek automatically falls back to taking a full-page screenshot and extracting text using Google's Gemini OCR.

✨ Features

  • πŸ›‘οΈ Cloudflare Detection: Automatically detects Cloudflare-protected sites
  • πŸ“Έ Screenshot Fallback: Takes full-page screenshots when traditional scraping fails
  • πŸ€– AI-Powered OCR: Uses Google Gemini to extract clean text from screenshots
  • ⚑ Smart Switching: Tries fast scraping first, falls back to OCR only when needed
  • πŸ”„ Auto-scrolling: Scrolls pages to bottom to capture all content
  • 🎯 Zero Config: Works out of the box with minimal setup
  • βš™οΈ Event Loop Safe: Automatically handles asyncio conflicts in Jupyter/existing loops

πŸš€ Installation

From GitHub

# Install in development mode
pip install -e git+https://github.com/Talha-Ali-5365/CloudflarePeek.git#egg=cloudflare-peek

# Or clone and install locally
git clone https://github.com/Talha-Ali-5365/CloudflarePeek.git
cd CloudflarePeek
pip install -e .

Additional Setup

  • Install Playwright browsers (required for screenshot functionality):
playwright install chromium
  • Get a Gemini API key (required for OCR):
    • Go to Google AI Studio
    • Create a new API key
    • Set it as an environment variable:
export GEMINI_API_KEY="your-gemini-api-key-here"

πŸ“– Quick Start

Basic Usage

from cloudflare_peek import peek

# Scrape any website - automatically handles Cloudflare
text = peek("https://example.com")
print(text)

Advanced Usage

from cloudflare_peek import peek, behind_cloudflare

# Check if a site is behind Cloudflare
if behind_cloudflare("https://example.com"):
    print("Site is protected by Cloudflare")

# Force OCR method (useful for dynamic content)
text = peek("https://example.com", force_ocr=True)

# Use with custom API key and timeout (5 minutes)
text = peek("https://example.com", api_key="your-gemini-key", timeout=300000)

CLI Usage

CloudflarePeek also comes with a powerful command-line interface.

Scrape a website:

cloudflare-peek scrape https://example.com

Check if a site is behind Cloudflare:

cloudflare-peek check-cloudflare https://example.com

Save content to a file:

cloudflare-peek scrape https://example.com -o content.txt

Advanced options:

# Force OCR, run in non-headless mode, and set a 60s timeout
cloudflare-peek scrape https://example.com --force-ocr --no-headless --timeout 60

# See all commands and options
cloudflare-peek --help
cloudflare-peek scrape --help

Environment Variables

# Required for OCR functionality
export GEMINI_API_KEY="your-gemini-api-key"

⏱️ Timeout Configuration

CloudflarePeek uses a default timeout of 2 minutes (120,000ms) for page loading during OCR extraction. You can customize this:

# Quick timeout (30 seconds) for fast sites
text = peek("https://example.com", timeout=30000)

# Extended timeout (5 minutes) for slow/complex sites  
text = peek("https://example.com", timeout=300000)

# Very long timeout (10 minutes) for extremely slow sites
text = peek("https://example.com", timeout=600000)

πŸ“‹ Progress Logging

CloudflarePeek provides detailed progress information during scraping:

import logging
from cloudflare_peek import peek

# Enable detailed debug logging to see all steps
logging.getLogger('cloudflare_peek').setLevel(logging.DEBUG)

# You'll see progress like:
# 🎯 Starting CloudflarePeek for: https://example.com
# πŸ” Checking if https://example.com is behind Cloudflare...
# πŸš€ No Cloudflare detected - attempting fast scraping...
# βœ… Fast scraping successful! (1234 characters extracted)

text = peek("https://example.com")

πŸ”§ Event Loop Compatibility

CloudflarePeek automatically handles asyncio event loop conflicts, so it works seamlessly in:

  • Jupyter Notebooks βœ…
  • IPython environments βœ…
  • Web frameworks (FastAPI, Django, etc.) βœ…
  • Standalone scripts βœ…

No need for nest_asyncio.apply() or other workarounds - it's all handled internally!

πŸ› οΈ API Reference

peek(url, api_key=None, force_ocr=False, timeout=120000)

The main function that intelligently chooses between fast scraping and OCR extraction.

Parameters:

  • url (str): The URL to scrape
  • api_key (str, optional): Gemini API key (uses GEMINI_API_KEY env var if not provided)
  • force_ocr (bool): Skip fast scraping and use OCR method directly
  • timeout (int): Page load timeout in milliseconds for OCR method (default: 120000 = 2 minutes)

Returns: Extracted text content as string

behind_cloudflare(url)

Check if a website is protected by Cloudflare.

Parameters:

  • url (str): The URL to check

Returns: True if behind Cloudflare, False otherwise

πŸ“ Examples

Example 1: Basic Website Scraping

from cloudflare_peek import peek

# Works with any website
websites = [
    "https://httpbin.org/html",
    "https://quotes.toscrape.com",
    "https://scrapethissite.com"
]

for url in websites:
    content = peek(url)
    print(f"Content from {url}:")
    print(content[:200] + "...")
    print("-" * 50)

Example 2: Batch Processing URLs

import asyncio
from cloudflare_peek import peek

async def scrape_multiple(urls):
    results = {}
    for url in urls:
        try:
            content = peek(url)
            results[url] = content
            print(f"βœ… Successfully scraped {url}")
        except Exception as e:
            print(f"❌ Failed to scrape {url}: {e}")
            results[url] = None
    return results

urls = ["https://example1.com", "https://example2.com"]
results = asyncio.run(scrape_multiple(urls))

Example 3: Error Handling

from cloudflare_peek import peek

def safe_scrape(url):
    try:
        return peek(url)
    except ValueError as e:
        if "API key" in str(e):
            print("❌ Gemini API key not found. Please set GEMINI_API_KEY environment variable.")
        return None
    except Exception as e:
        print(f"❌ Scraping failed: {e}")
        return None

content = safe_scrape("https://example.com")
if content:
    print("Scraping successful!")

πŸ”§ Development

Setting up for Development

# Clone the repository
git clone https://github.com/your-username/CloudflarePeek.git
cd CloudflarePeek

# Install in development mode with dev dependencies
pip install -e ".[dev]"

# Install Playwright browsers
playwright install chromium

# Set up your API key
export GEMINI_API_KEY="your-key-here"

Running Tests

# Run tests
pytest

# Run tests with coverage
pytest --cov=cloudflare_peek

Code Formatting

# Format code
black cloudflare_peek/

# Check types
mypy cloudflare_peek/

🀝 Contributing

  • Fork the repository
  • Create a feature branch (git checkout -b feature/amazing-feature)
  • Make your changes
  • Add tests for new functionality
  • Commit your changes (git commit -m 'Add amazing feature')
  • Push to the branch (git push origin feature/amazing-feature)
  • Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

⚠️ Disclaimer

This tool is for educational purposes only. The author is not responsible for any misuse or damage caused by this tool.

This tool is intended for legitimate web scraping purposes only. Always respect websites' robots.txt files and terms of service. Be mindful of rate limiting and don't overload servers with requests.

πŸ™ Acknowledgments

  • Playwright for browser automation
  • Google Gemini for OCR capabilities
  • LangChain for traditional web scraping
  • The open source community for inspiration and support

Keywords

scraping

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

About

Packages

Stay in touch

Get open source security insights delivered straight into your inbox.

  • Terms
  • Privacy
  • Security

Made with ⚑️ by Socket Inc

U.S. Patent No. 12,346,443 & 12,314,394. Other pending.