Scrapme
A comprehensive web scraping framework featuring both static and dynamic content extraction, automatic Selenium/geckodriver management, rate limiting, proxy rotation, and Unicode support (including Georgian). Built with BeautifulSoup4 and Selenium, it provides an intuitive API for extracting text, tables, links and more from any web source.
Features
- 🚀 Simple and intuitive API
- 🔄 Support for JavaScript-rendered content using Selenium
- 🛠️ Automatic geckodriver management
- ⏱️ Built-in rate limiting
- 🔄 Proxy rotation with health tracking
- 📊 Automatic table parsing to Pandas DataFrames
- 🌐 Full Unicode support (including Georgian)
- 🧹 Clean text extraction
- 🎯 CSS selector support
- 🔍 Multiple content extraction methods
Installation
pip install scrapme
Quick Start
Basic Usage (Static Content)
from scrapme import WebScraper
scraper = WebScraper()
text = scraper.get_text("https://example.com")
print(text)
links = scraper.get_links("https://example.com")
for link in links:
print(f"Text: {link['text']}, URL: {link['href']}")
tables = scraper.get_tables("https://example.com")
if tables:
print(tables[0].head())
Dynamic Content (JavaScript-Rendered)
from scrapme import SeleniumScraper
scraper = SeleniumScraper(headless=True)
text = scraper.get_text("https://example.com")
print(text)
title = scraper.execute_script("return document.title;")
print(f"Page title: {title}")
scraper.scroll_infinite(max_scrolls=5)
Custom Geckodriver Path
from scrapme import SeleniumScraper
import os
driver_path = os.getenv('GECKODRIVER_PATH', '/path/to/geckodriver')
scraper = SeleniumScraper(driver_path=driver_path)
Rate Limiting and Proxy Rotation
from scrapme import WebScraper
proxies = [
'http://proxy1.example.com:8080',
'http://proxy2.example.com:8080'
]
scraper = WebScraper(
requests_per_second=0.5,
proxies=proxies
)
scraper.add_proxy('http://proxy3.example.com:8080')
scraper.set_rate_limit(0.2)
Unicode Support (Including Georgian)
from scrapme import WebScraper
scraper = WebScraper(
headers={'Accept-Language': 'ka-GE,ka;q=0.9'},
encoding='utf-8'
)
text = scraper.get_text("https://example.ge")
print(text)
Advanced Features
Content Selection Methods
elements = scraper.find_by_selector("https://example.com", "div.content > p")
elements = scraper.find_by_class("https://example.com", "main-content")
element = scraper.find_by_id("https://example.com", "header")
elements = scraper.find_by_tag("https://example.com", "article")
Selenium Wait Conditions
from scrapme import SeleniumScraper
scraper = SeleniumScraper()
soup = scraper.get_soup(url, wait_for="#dynamic-content")
soup = scraper.get_soup(url, wait_for="#loading", wait_type="visibility")
Error Handling
The package provides custom exceptions for better error handling:
from scrapme import ScraperException, RequestException, ParsingException
try:
scraper.get_text("https://example.com")
except RequestException as e:
print(f"Failed to fetch content: {e}")
except ParsingException as e:
print(f"Failed to parse content: {e}")
except ScraperException as e:
print(f"General scraping error: {e}")
Best Practices
-
Rate Limiting: Always use rate limiting to avoid overwhelming servers:
scraper = WebScraper(requests_per_second=0.5)
-
Proxy Rotation: For large-scale scraping, rotate through multiple proxies:
scraper = WebScraper(proxies=['proxy1', 'proxy2', 'proxy3'])
-
Resource Management: Use context managers or clean up Selenium resources:
scraper = SeleniumScraper()
try:
finally:
del scraper
-
Error Handling: Always implement proper error handling:
try:
scraper.get_text(url)
except ScraperException as e:
logging.error(f"Scraping failed: {e}")
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Support
For support, please open an issue on the GitHub repository or contact info@ubix.pro.