Socket
Book a DemoInstallSign in
Socket

pdf2epub

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

pdf2epub

Convert PDF files to EPUB format via Markdown with intelligent layout detection

0.1.2
pipPyPI
Maintainers
1

PDF2EPUB ๐Ÿ“š

PyPI version CI/CD Pipeline Python 3.9+ License: MIT

A powerful Python package for converting PDF files to EPUB format via Markdown with intelligent layout detection, AI-powered postprocessing, and seamless CLI/API integration.

โœจ Features

  • ๐Ÿ“– Smart Layout Detection - Handles books, academic papers, and complex documents
  • ๐Ÿ” Advanced PDF Processing - OCR, table detection, and image extraction
  • ๐Ÿค– AI Postprocessing - Enhance quality with Anthropic Claude integration
  • ๐Ÿ“ Clean Markdown Output - Structured, readable markdown with preserved formatting
  • ๐Ÿ“ฑ Professional EPUB - High-quality EPUB 3.0 output with customizable styling
  • ๐ŸŒ Multi-language Support - Process documents in multiple languages
  • ๐Ÿš€ GPU Acceleration - NVIDIA CUDA and AMD ROCm support for faster processing
  • ๐ŸŽ Apple Silicon Support - Optimized performance on Apple Silicon devices
  • ๐Ÿ› ๏ธ Flexible API - Use as CLI tool or import as Python library
  • ๐Ÿ”Œ Plugin Architecture - Extensible AI provider system

๐Ÿš€ Quick Start

Installation

# Basic installation
pip install pdf2epub

# Full installation with all features
pip install pdf2epub[full]

Command Line Usage

# Convert a PDF to EPUB
pdf2epub document.pdf

# Advanced options
pdf2epub book.pdf --start-page 10 --max-pages 50 --langs "English,German"

Python API

  • For Apple Silicon, install with MPS support:
pip3 uninstall torch torchvision torchaudio
pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu
  • For Apple Silicon, install with MPS support:
pip3 uninstall torch torchvision torchaudio
pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu
  • Verify GPU support:
import torch
print(torch.__version__)  # PyTorch version
print(torch.cuda.is_available())  # Should return True for NVIDIA
print(torch.mps.is_available())  # Should return True for Apple Silicon
print(torch.version.hip)  # Should print ROCm version for AMD

import pdf2epub

# Simple conversion
pdf2epub.convert_pdf_to_markdown("document.pdf", "output/")
pdf2epub.convert_markdown_to_epub("output/", "final/")

# Advanced usage with AI enhancement
processor = pdf2epub.AIPostprocessor("output/")
processor.run_postprocessing("document.md", "anthropic")

๐Ÿ“ฆ Installation Options

Basic Installation

pip install pdf2epub

Includes core functionality with minimal dependencies.

Full Installation

pip install pdf2epub[full]

Includes all features: PDF processing, AI postprocessing, and GPU acceleration.

Development Installation

pip install pdf2epub[dev]

Includes development tools: testing, linting, and formatting.

GPU Support

NVIDIA CUDA:

pip install pdf2epub[full]
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

AMD ROCm:

pip install pdf2epub[full]
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2

๐Ÿ“š Documentation

๐ŸŽฏ Use Cases

Academic Research

  • Convert research papers to readable EPUB format
  • Extract and preserve mathematical equations
  • Maintain citation formatting and structure

Digital Publishing

  • Transform print-ready PDFs into distribution-ready EPUBs
  • Preserve complex layouts and formatting
  • Optimize for e-reader compatibility

Document Archival

  • Convert legacy documents to modern formats
  • Batch process document collections
  • Enhance readability with AI postprocessing

Accessibility

  • Create screen-reader compatible versions
  • Improve text structure and navigation
  • Add semantic markup for better accessibility

๐Ÿ”ง Configuration

Environment Variables

# Required for AI postprocessing
export ANTHROPIC_API_KEY="your-anthropic-api-key"

# Optional: Control GPU usage
export CUDA_VISIBLE_DEVICES="0"  # Use specific GPU
export CUDA_VISIBLE_DEVICES=""   # Force CPU-only mode

API Configuration

import pdf2epub

# Configure default settings
pdf2epub.config.set_default_batch_multiplier(3)
pdf2epub.config.set_default_ai_provider("anthropic")

๐Ÿงช Testing

Run the test suite:

pytest                    # Run all tests
pytest --cov=pdf2epub   # Run with coverage
pytest tests/test_pdf2md.py  # Run specific test file

Current test coverage: 49% with 100% pass rate (41/41 tests)

๐Ÿ”Œ Plugin System

Create custom AI postprocessing providers:

from pdf2epub.postprocessing.ai import AIPostprocessor

class CustomAIProvider:
    @staticmethod
    def getjsonparams(system_prompt: str, request: str) -> str:
        # Implement your AI API integration
        return process_with_custom_ai(system_prompt, request)

# Register and use your provider
processor = AIPostprocessor(work_dir)
processor.register_provider("custom", CustomAIProvider)
processor.run_postprocessing(markdown_file, "custom")

๐Ÿ“Š Performance

Benchmarks

Document TypePagesProcessing TimeMemory Usage
Research Paper2045 seconds2.1 GB
Technical Book2006 minutes4.8 GB
Magazine502 minutes1.9 GB

Results on NVIDIA RTX 3080 with 16GB RAM

Optimization Tips

  • Use GPU acceleration for 3-5x speed improvement
  • Adjust batch multiplier based on available memory
  • Process in chunks for very large documents
  • Enable AI postprocessing for best quality (slower)

๐Ÿ†š Comparison

FeaturePDF2EPUBcalibrepandoc
AI Enhancementโœ…โŒโŒ
Layout Detectionโœ…โš ๏ธโš ๏ธ
GPU Accelerationโœ…โŒโŒ
Python APIโœ…โš ๏ธโš ๏ธ
Plugin Systemโœ…โœ…โŒ
CLI Interfaceโœ…โœ…โœ…

๐Ÿšข Deployment

Docker

FROM python:3.11-slim

RUN pip install pdf2epub[full]

WORKDIR /workspace
ENTRYPOINT ["pdf2epub"]

GitHub Actions

- name: Convert PDFs
  run: |
    pip install pdf2epub[full]
    pdf2epub documents/*.pdf

Production Deployment

import pdf2epub
from pathlib import Path

def production_converter(pdf_path: str) -> dict:
    """Production-ready PDF conversion with error handling."""
    try:
        output_dir = pdf2epub.convert_pdf_to_markdown(
            pdf_path, 
            batch_multiplier=2,  # Conservative memory usage
            max_pages=1000      # Prevent runaway processing
        )
        
        epub_path = pdf2epub.convert_to_epub(output_dir)
        
        return {
            "status": "success",
            "markdown_path": output_dir,
            "epub_path": epub_path,
            "processing_time": time.time() - start_time
        }
        
    except Exception as e:
        return {
            "status": "error", 
            "error": str(e)
        }

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guide for details.

Quick Contributing Steps

  • Fork the repository
  • Create a feature branch: git checkout -b feature-name
  • Make your changes and add tests
  • Test your changes: pytest
  • Format code: black .
  • Submit a pull request

See CONTRIBUTING.md for detailed guidelines.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

This project builds upon excellent open-source libraries:

๐Ÿ“ˆ Project Status

  • Version: 0.1.0 (Beta)
  • Status: Active development
  • Python: 3.9+ supported
  • Testing: 49% coverage, 100% pass rate
  • CI/CD: GitHub Actions
  • Documentation: Comprehensive

๐Ÿ“ž Support

Transform your PDFs into beautiful, accessible EPUBs with AI-powered enhancement! ๐Ÿš€๐Ÿ“š

Keywords

pdf

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

About

Packages

Stay in touch

Get open source security insights delivered straight into your inbox.

  • Terms
  • Privacy
  • Security

Made with โšก๏ธ by Socket Inc

U.S. Patent No. 12,346,443 & 12,314,394. Other pending.