You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP
Socket
Book a DemoInstallSign in
Socket

ai-html-parse

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

ai-html-parse

AI HTML Parser

0.1.0
pipPyPI
Maintainers
1

AI-Powered HTML Parser

Installation

Prerequisites

  • Python 3.8 or higher
  • Required Libraries:
    • requests
    • bs4 (BeautifulSoup)

Steps

  • Clone the repository:
    git clone https://github.com/pythonshik/ai-html-parser.git
    cd ai-html-parser
    
  • Install dependencies:
    pip install -r requirements.txt
    
  • Set up your API key for Google Gemini:
    • Create a folder named AI in the root directory.
    • Add your API key to a file named gemini_api_key inside the AI folder.

Usage

Example

  • Import the AIparser class:
    from AIparse import AIparser
    
  • Initialize the parser with a URL:
    element = AIparser("https://www.youtube.com/@PythonShik")
    
  • Parse specific elements:
    for i in ["number of videos", "number of subscribers"]:
        parsed_data = element.parse(i)
        print(f"{parsed_data['explain']}: {parsed_data['value']}")
    
  • Output example:
    {
      "value": "96",
      "explain": "Number of subscribers",
      "result": "96 subscribers"
    }
    

Overview

This project is an AI-powered HTML parser designed to extract specific data from web pages using Google Gemini's text generation API. The parser processes the HTML source code of a webpage, identifies specific elements, and returns the desired information in a structured JSON format.

Key Features

  • AI Integration: Utilizes Google Gemini for intelligent text analysis.
  • HTML Parsing: Extracts and processes HTML elements using BeautifulSoup.
  • Customizable Instructions: Supports user-defined parsing instructions.
  • JSON Output: Provides clear and structured results in JSON format.

How It Works

  • User Input: Provide a URL and the target element to parse.
  • HTML Fetching: The tool fetches the HTML source code of the webpage.
  • AI Analysis: The HTML source and target element are sent to the AI for processing.
  • JSON Output: The AI generates a structured response containing the extracted information.

File Descriptions

1. BASE.py

The core class for interacting with Google Gemini's text generation API.

  • Features:
    • API key management.
    • Methods for adding and managing conversation history.
    • Text generation using the generate() method.
  • Key Methods:
    • history_add(role, content): Adds messages to the conversation history.
    • generate(): Sends data to gemini API and retrieves the generated text.
    • export_history(filename): Saves conversation history to a file.
    • import_history(filename): Loads conversation history from a file.
    • clear_history(filename): Clears the conversation history.

2. prompts.py

Defines the instruction format for AI tasks.

  • Key Class: Instructions
    • first_instruction: Provides a detailed guide for parsing HTML elements and formatting the response.

3. main.py

The main entry point for the application.

  • Features:
    • Manages the parsing process using AIparser.
    • Configures and interacts with the Gen class for AI communication.
    • Outputs results for specific elements like "number of subscribers" or "number of videos".
  • Key Methods:
    • AIparser.__init__: Initializes the parser with a URL and target element.
    • AIparser.parse(element): Parses the given element and retrieves AI-generated results.

Target Audience

This tool is ideal for:

  • Marketers and Analysts: For monitoring trends, gathering competitor data, and extracting insights.
  • Small and Medium Businesses: To automate tasks like market monitoring or customer review aggregation.
  • SEO Specialists: To analyze site content, keywords, and metadata.
  • Developers and Freelancers: To speed up the execution of client parsing tasks.
  • Journalists and Bloggers: To gather data for articles and posts effortlessly.

Limitations

  • Speed: Processing time can take up to 45 seconds due to the AI generation.
  • Dependencies: Requires an active internet connection and a valid API key.
  • Scalability: Not optimized for high-frequency requests.

Potential Use Cases

  • Monitoring changes on web pages.
  • Extracting market research data.
  • Analyzing competitors' content.
  • Automating reporting tasks.

Future Improvements

  • Optimize performance with batch processing and caching.
  • Add support for local AI models to reduce dependency on external APIs.
  • Expand parsing capabilities to include other data formats like JSON and XML.
  • Develop a user-friendly interface (e.g., Telegram bot or web app).

Contributing

Feel free to contribute to the project by submitting issues or pull requests.

License

This project is licensed under the MIT License.

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts