🚀 Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more →

Demo Install Sign in

ai-tech-crawler

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

ai-tech-crawler

Autonomous Scraping Agent: Scrape URLs with prompts and schema

0.0.4

PyPI

Maintainers: 1

AI Tech Crawler - Smart Scraping Agent

SmartScrapingAgent is a Python package designed to simplify web scraping using state-of-the-art LLMs (Large Language Models) and customizable schemas. With this package, you can extract structured data efficiently from large, complex web pages.

Installation

Step 1: Install SmartScrapingAgent

Install the Smart Scraping Agent package:

pip install ai-tech-crawler

Step 2: Install Playwright

Install Playwright, which is required for handling headless browsing:

playwright install

Usage

Here is a step-by-step guide to using the Smart Scraping Agent package:

N.B.: This import is required only for jupyter notebooks, since they have their own eventloop

pip install nest-asyncio

import nest_asyncio

nest_asyncio.apply()

Step 1: Import Required Modules

Import necessary modules and classes:

import os, json
from dotenv import load_dotenv


load_dotenv()

Step 2: Define the Configuration

Set up the configuration for the scraping pipeline:

agent_config = {
    "llm": {
        "api_key": os.getenv('OPENAI_API_KEY'),
        "model": "openai/gpt-4o-mini",
        # Uncomment for other models
        # "model": "ollama/nemotron-mini",
        # "device": "cuda",
        # "model_kwargs": {'response_format': {"type": "json_object"}}
    },
    "verbose": True,
    "headless": True,
    "max_retries": 3
}

Step 3: Write Your Prompt

Define a simple prompt to guide the scraping process:

simple_prompt = """
Extract all the trending topics, their search volumes, when it started trending and the trend breakdown from the website's content.
"""

Step 4: Load the Schema

Use the schema to define the structure of the extracted data: Schema can be:

format instruction string with examples
dict
json
pydantic or BaseModel

schema_ = {
    'trends': [
        {
            'topic': 'Trending topic',
            'search_volume': 'Search Volume of a topic',
            'started': 'Time when it started trending',
            'trend_breakdown': 'A trend may consist of multiple queries that are variants of the same search or considered to be related. Trend breakdown details these queries.'
         }
    ],
    'other_links':[
        'list of any other reference URLs'
    ]
}

N.B.: For better results use a valid pydantic schema which is a subclass of BaseModel.

Step 5: Instantiate the SmartScraperAgent

Create an instance of the SmartScraperAgent with the necessary parameters:

from ai_tech_crawler import SmartScraperAgent

url = "https://trends.google.com/trending"

smart_scraper_agent = SmartScraperAgent(
    prompt=simple_prompt,
    source=url,
    config=agent_config,
    schema=schema_
)

Step 6: Run the Scraper

Execute the scraping pipeline and process the results:

result = smart_scraper_agent.run()
print(json.dumps(result, indent=4))

Load a Webpage content as Markdown

markdown_content = smart_scraper_agent.get_markdown()
print(markdown_content)

Load recursive webpages and split it into Documents

documents = smart_scraper_agent.load_indepth_and_split(depth=2)
print(documents[0].page_content)
print(documents[0].metadata)

Key Features

LLM-Powered: Leverage advanced models like GPT for smart data extraction.
Schema-Driven: Flexible schema design to control output format.
Headless Browsing: Playwright integration for efficient, non-visual browsing.
Customizable: Fine-tune the pipeline using configurations and custom merge methods.

Contributing

Contributions are welcome! Feel free to submit issues or pull requests on the GitHub repository.

License

This project is licensed under the MIT License. See the LICENSE file for details.

FAQs

What is ai-tech-crawler?

Is ai-tech-crawler well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

ai-tech-crawler

AI Tech Crawler - Smart Scraping Agent

Installation

Step 1: Install SmartScrapingAgent

Step 2: Install Playwright

Usage

Step 1: Import Required Modules

Step 2: Define the Configuration

Step 3: Write Your Prompt

Step 4: Load the Schema

Step 5: Instantiate the SmartScraperAgent

Step 6: Run the Scraper

Load a Webpage content as Markdown

Load recursive webpages and split it into Documents

Key Features

Contributing

License

Related posts

pnpm 10.12 Introduces Global Virtual Store and Expanded Version Catalogs

Node.js Moves Toward Stable TypeScript Support with Amaro 1.0