New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More →

markdrop

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

markdrop

A comprehensive PDF processing toolkit that converts PDFs to markdown with advanced AI-powered features for image and table analysis. Supports local files and URLs, preserves document structure, extracts high-quality images, detects tables using advanced ML models, and generates detailed content descriptions using multiple LLM providers including OpenAI and Google's Gemini.

0.3.1.3
PyPI

Maintainers: 1

MARKDROP

A Python package for converting PDFs (or PDF URLs) to markdown while extracting images and tables, with advanced features for AI-powered content analysis and descriptions.

Features

Installation

pip install markdrop

https://pypi.org/project/markdrop

Quick Start

Basic PDF Processing

from markdrop import extract_images, make_markdown, extract_tables_from_pdf

source_pdf = 'url/or/path/to/pdf/file'    # Replace with your local PDF file path or a URL
output_dir = 'data/output'                 # Replace with desired output directory's path

make_markdown(source_pdf, output_dir)
extract_images(source_pdf, output_dir)
extract_tables_from_pdf(source_pdf, output_dir=output_dir)

Advanced PDF Processing with MarkDrop

from markdrop import markdrop, MarkDropConfig, add_downloadable_tables
from pathlib import Path
import logging

# Configure processing options
config = MarkDropConfig(
    image_resolution_scale=2.0,        # Scale factor for image resolution
    download_button_color='#444444',   # Color for download buttons in HTML
    log_level=logging.INFO,           # Logging detail level
    log_dir='logs',                   # Directory for log files
    excel_dir='markdropped-excel-tables'  # Directory for Excel table exports
)

# Process PDF document
input_doc_path = "path/to/input.pdf"
output_dir = Path('output_directory')

# Convert PDF and generate HTML with images and tables
html_path = markdrop(input_doc_path, output_dir, config)

# Add interactive table download functionality
downloadable_html = add_downloadable_tables(html_path, config)

AI-Powered Content Analysis

from markdrop import setup_keys, process_markdown, ProcessorConfig, AIProvider
from pathlib import Path

# Set up API keys for AI providers
setup_keys(key='gemini')  # or setup_keys(key='openai')

# Configure AI processing options
config = ProcessorConfig(
    input_path="path/to/markdown/file.md",    # Input markdown file path
    output_dir=Path("output_directory"),      # Output directory
    ai_provider=AIProvider.GEMINI,            # AI provider (GEMINI or OPENAI)
    remove_images=False,                      # Keep or remove original images
    remove_tables=False,                      # Keep or remove original tables
    table_descriptions=True,                  # Generate table descriptions
    image_descriptions=True,                  # Generate image descriptions
    max_retries=3,                           # Number of API call retries
    retry_delay=2,                           # Delay between retries in seconds
    gemini_model_name="gemini-1.5-flash",    # Gemini model for images
    gemini_text_model_name="gemini-pro",     # Gemini model for text
    image_prompt=DEFAULT_IMAGE_PROMPT,        # Custom prompt for image analysis
    table_prompt=DEFAULT_TABLE_PROMPT         # Custom prompt for table analysis
)

# Process markdown with AI descriptions
output_path = process_markdown(config)

Image Description Generation

from markdrop import generate_descriptions

prompt = "Give textual highly detailed descriptions from this image ONLY, nothing else."
input_path = 'path/to/img_file/or/dir'
output_dir = 'data/output'
llm_clients = ['gemini', 'llama-vision']  # Available: ['qwen', 'gemini', 'openai', 'llama-vision', 'molmo', 'pixtral']

generate_descriptions(
    input_path=input_path,
    output_dir=output_dir,
    prompt=prompt,
    llm_client=llm_clients
)

API Reference

Core Functions

markdrop(input_doc_path: str, output_dir: str, config: Optional[MarkDropConfig] = None) -> Path

Converts PDF to markdown and HTML with enhanced features.

Parameters:

input_doc_path (str): Path to input PDF file
output_dir (str): Output directory path
config (MarkDropConfig, optional): Configuration options for processing

add_downloadable_tables(html_path: Path, config: Optional[MarkDropConfig] = None) -> Path

Adds interactive table download functionality to HTML output.

Parameters:

html_path (Path): Path to HTML file
config (MarkDropConfig, optional): Configuration options

Configuration Classes

MarkDropConfig

Configuration for PDF processing:

image_resolution_scale (float): Scale factor for image resolution (default: 2.0)
download_button_color (str): HTML color code for download buttons (default: '#444444')
log_level (int): Logging level (default: logging.INFO)
log_dir (str): Directory for log files (default: 'logs')
excel_dir (str): Directory for Excel table exports (default: 'markdropped-excel-tables')

ProcessorConfig

Configuration for AI processing:

input_path (str): Path to markdown file
output_dir (str): Output directory path
ai_provider (AIProvider): AI provider selection (GEMINI or OPENAI)
remove_images (bool): Whether to remove original images
remove_tables (bool): Whether to remove original tables
table_descriptions (bool): Generate table descriptions
image_descriptions (bool): Generate image descriptions
max_retries (int): Maximum API call retries
retry_delay (int): Delay between retries in seconds
gemini_model_name (str): Gemini model for image processing
gemini_text_model_name (str): Gemini model for text processing
image_prompt (str): Custom prompt for image analysis
table_prompt (str): Custom prompt for table analysis

Legacy Functions

make_markdown(source: str, output_dir: str, verbose: bool = False)

Legacy function for basic PDF to markdown conversion.

Parameters:

source (str): Path to input PDF or URL
output_dir (str): Output directory path
verbose (bool): Enable detailed logging

extract_images(source: str, output_dir: str, verbose: bool = False)

Legacy function for basic image extraction.

Parameters:

source (str): Path to input PDF or URL
output_dir (str): Output directory path
verbose (bool): Enable detailed logging

extract_tables_from_pdf(pdf_path: str, **kwargs)

Legacy function for basic table extraction.

Parameters:

pdf_path (str): Path to input PDF or URL
start_page (int, optional): Starting page number
end_page (int, optional): Ending page number
threshold (float, optional): Detection confidence threshold
output_dir (str): Output directory path

Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

Clone the repository:

git clone https://github.com/shoryasethia/markdrop.git  
cd markdrop

Create a virtual environment:

python -m venv venv  
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install development dependencies:

pip install -r requirements.txt

Project Structure

markdrop/  
â”œâ”€â”€ LICENSE  
â”œâ”€â”€ README.md  
â”œâ”€â”€ CONTRIBUTING.md  
â”œâ”€â”€ CHANGELOG.md  
â”œâ”€â”€ requirements.txt  
â”œâ”€â”€ setup.py  
â””â”€â”€ markdrop/ 
    â”œâ”€â”€ __init__.py 
    â”œâ”€â”€ src
    |    â””â”€â”€ markdrop-logo.png
    â”œâ”€â”€ main.py
    â”œâ”€â”€ process.py
    â”œâ”€â”€ api_setup.py
    â”œâ”€â”€ parse.py
    â”œâ”€â”€ utils.py  
    â”œâ”€â”€ helper.py
    â”œâ”€â”€ ignore_warnings.py
    â”œâ”€â”€ run.py
    â””â”€â”€ models/
        â”œâ”€â”€ __init__.py
        â”œâ”€â”€ .env
        â”œâ”€â”€ img_descriptions.py
        â”œâ”€â”€ logger.py
        â”œâ”€â”€ model_loader.py
        â”œâ”€â”€ responder.py
        â””â”€â”€ setup_keys.py

License

This project is licensed under the MIT License - see the LICENSE file for details.

Changelog

See CHANGELOG.md for version history.

Code of Conduct

Please note that this project follows our Code of Conduct.

Support

Open an issue

Keywords

pdf markdown converter ai llm table-extraction image-analysis document-processing gemini openai

FAQs

What is markdrop?

Is markdrop well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

markdrop

MARKDROP

Features

Installation

Quick Start

Basic PDF Processing

Advanced PDF Processing with MarkDrop

AI-Powered Content Analysis

Image Description Generation

API Reference

Core Functions

markdrop(input_doc_path: str, output_dir: str, config: Optional[MarkDropConfig] = None) -> Path

add_downloadable_tables(html_path: Path, config: Optional[MarkDropConfig] = None) -> Path

Configuration Classes

MarkDropConfig

ProcessorConfig

Legacy Functions

make_markdown(source: str, output_dir: str, verbose: bool = False)

extract_images(source: str, output_dir: str, verbose: bool = False)

extract_tables_from_pdf(pdf_path: str, **kwargs)

Contributing

Development Setup

Project Structure

License

Changelog

Code of Conduct

Support

Keywords

Related posts

Linux Foundation Warns Open Source Developers: Compliance with Sanctions Is Not Optional

Maven Central Adds Sigstore Signature Validation