🚀 Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more
Socket
Sign inDemoInstall
Socket

ai-tech-crawler

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

ai-tech-crawler

Autonomous Scraping Agent: Scrape URLs with prompts and schema

0.0.2
PyPI
Maintainers
1

SmartScrapingAgent

SmartScrapingAgent is a Python package designed to simplify web scraping using state-of-the-art LLMs (Large Language Models) and customizable schemas. With this package, you can extract structured data efficiently from large, complex web pages.

Installation

Step 1: Install Playwright

Install Playwright, which is required for handling headless browsing:

playwright install

Step 2: Install SmartScrapingAgent

Install the Smart Scraping Agent package:

pip install ai-tech-crawler

Usage

Here is a step-by-step guide to using the Smart Scraping Agent package:

N.B.: This import is required only for jupyter notebooks, since they have their own eventloop

pip install nest-asyncio
import nest_asyncio

nest_asyncio.apply()

Step 1: Import Required Modules

Import necessary modules and classes:

import os, json
from dotenv import load_dotenv


load_dotenv()

Step 2: Define the Configuration

Set up the configuration for the scraping pipeline:

agent_config = {
    "llm": {
        "api_key": os.getenv('OPENAI_API_KEY'),
        "model": "openai/gpt-4o-mini",
        # Uncomment for other models
        # "model": "ollama/nemotron-mini",
        # "device": "cuda",
        # "model_kwargs": {'response_format': {"type": "json_object"}}
    },
    "verbose": True,
    "headless": True,
    "max_retries": 3
}

Step 3: Write Your Prompt

Define a simple prompt to guide the scraping process:

simple_prompt = """
Extract all the trending topics, their search volumes, when it started trending and the trend breakdown from the website's content.
"""

Step 4: Load the Schema

Use the schema to define the structure of the extracted data: Schema can be:

  • format instruction string with examples
  • dict
  • json
  • pydantic or BaseModel
schema_ = {
    'trends': [
        {
            'topic': 'Trending topic',
            'search_volume': 'Search Volume of a topic',
            'started': 'Time when it started trending',
            'trend_breakdown': 'A trend may consist of multiple queries that are variants of the same search or considered to be related. Trend breakdown details these queries.'
         }
    ],
    'other_links':[
        'list of any other reference URLs'
    ]
}

N.B.: For better results use a valid pydantic schema which is a subclass of BaseModel.

Step 5: Instantiate the SmartScraperAgent

Create an instance of the SmartScraperAgent with the necessary parameters:

from ai_tech_crawler import SmartScraperAgent

url = "https://trends.google.com/trending"

smart_scraper_agent = SmartScraperAgent(
    prompt=simple_prompt,
    source=url,
    config=agent_config,
    schema=schema_
)

Step 6: Run the Scraper

Execute the scraping pipeline and process the results:

result = smart_scraper_agent.run()
print(json.dumps(result, indent=4))

Load a Webpage content as Markdown

markdown_content = smart_scraper_agent.get_markdown()
print(markdown_content)

Load recursive webpages and split it into Documents

documents = smart_scraper_agent.load_indepth_and_split(depth=2)
print(documents[0].page_content)
print(documents[0].nmetadata)

Key Features

  • LLM-Powered: Leverage advanced models like GPT for smart data extraction.
  • Schema-Driven: Flexible schema design to control output format.
  • Headless Browsing: Playwright integration for efficient, non-visual browsing.
  • Customizable: Fine-tune the pipeline using configurations and custom merge methods.

Contributing

Contributions are welcome! Feel free to submit issues or pull requests on the GitHub repository.

License

This project is licensed under the MIT License. See the LICENSE file for details.

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts