
Security News
MCP Community Begins Work on Official MCP Metaregistry
The MCP community is launching an official registry to standardize AI tool discovery and let agents dynamically find and install MCP servers.
SmartScrapingAgent is a Python package designed to simplify web scraping using state-of-the-art LLMs (Large Language Models) and customizable schemas. With this package, you can extract structured data efficiently from large, complex web pages.
Install Playwright, which is required for handling headless browsing:
playwright install
Install the Smart Scraping Agent package:
pip install ai-tech-crawler
Here is a step-by-step guide to using the Smart Scraping Agent package:
N.B.: This import is required only for jupyter notebooks, since they have their own eventloop
pip install nest-asyncio
import nest_asyncio
nest_asyncio.apply()
Import necessary modules and classes:
import os, json
from dotenv import load_dotenv
load_dotenv()
Set up the configuration for the scraping pipeline:
agent_config = {
"llm": {
"api_key": os.getenv('OPENAI_API_KEY'),
"model": "openai/gpt-4o-mini",
# Uncomment for other models
# "model": "ollama/nemotron-mini",
# "device": "cuda",
# "model_kwargs": {'response_format': {"type": "json_object"}}
},
"verbose": True,
"headless": True,
"max_retries": 3
}
Define a simple prompt to guide the scraping process:
simple_prompt = """
Extract all the trending topics, their search volumes, when it started trending and the trend breakdown from the website's content.
"""
Use the schema to define the structure of the extracted data: Schema can be:
format instruction string with examples
dict
json
pydantic
or BaseModel
schema_ = {
'trends': [
{
'topic': 'Trending topic',
'search_volume': 'Search Volume of a topic',
'started': 'Time when it started trending',
'trend_breakdown': 'A trend may consist of multiple queries that are variants of the same search or considered to be related. Trend breakdown details these queries.'
}
],
'other_links':[
'list of any other reference URLs'
]
}
N.B.: For better results use a valid pydantic
schema which is a subclass of BaseModel
.
Create an instance of the SmartScraperAgent with the necessary parameters:
from ai_tech_crawler import SmartScraperAgent
url = "https://trends.google.com/trending"
smart_scraper_agent = SmartScraperAgent(
prompt=simple_prompt,
source=url,
config=agent_config,
schema=schema_
)
Execute the scraping pipeline and process the results:
result = smart_scraper_agent.run()
print(json.dumps(result, indent=4))
markdown_content = smart_scraper_agent.get_markdown()
print(markdown_content)
documents = smart_scraper_agent.load_indepth_and_split(depth=2)
print(documents[0].page_content)
print(documents[0].nmetadata)
Contributions are welcome! Feel free to submit issues or pull requests on the GitHub repository.
This project is licensed under the MIT License. See the LICENSE file for details.
FAQs
Autonomous Scraping Agent: Scrape URLs with prompts and schema
We found that ai-tech-crawler demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The MCP community is launching an official registry to standardize AI tool discovery and let agents dynamically find and install MCP servers.
Research
Security News
Socket uncovers an npm Trojan stealing crypto wallets and BullX credentials via obfuscated code and Telegram exfiltration.
Research
Security News
Malicious npm packages posing as developer tools target macOS Cursor IDE users, stealing credentials and modifying files to gain persistent backdoor access.