New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More →

pdfcompare

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

pdfcompare

A Python package to compare files (PDF, docx, images) and generate reports in txt, html, or PDF format

0.2.0
PyPI

Maintainers: 1

PDFCompare

PDFCompare is a Python package designed for comparing multiple file types, including PDF, DOCX, and scanned images. It generates detailed difference reports that can be exported in TXT, HTML, and PDF formats. The package utilizes PyMuPDF for parsing PDFs, pytesseract for OCR on images, and python-docx for DOCX parsing. Additionally, it now includes advanced image preprocessing for improved OCR accuracy using OpenCV.

Features

Compare multiple file types: PDF, DOCX, and scanned image files.
Export comparison reports: Generate and save reports in TXT, HTML, or PDF formats.
OCR for image files: Supports text extraction from scanned PDFs or images using pytesseract with advanced preprocessing.
Advanced image preprocessing: Leverage OpenCV for binarization, noise removal, and other image enhancements to improve OCR accuracy.
Easy-to-use CLI: Run comparisons via the command line or integrate into your own Python applications.

Installation

Python Requirements

Python 3.7+

External Dependencies

The following external dependencies are required for handling PDF parsing and OCR:

Tesseract OCR: For extracting text from images or scanned PDFs.
wkhtmltopdf: For converting HTML reports into PDFs.
OpenCV: For image preprocessing before OCR.

Installing Tesseract

Linux (Debian/Ubuntu)

sudo apt-get update
sudo apt-get install tesseract-ocr

MacOS

If you have Homebrew installed, run:

brew install tesseract

Windows

Download the Tesseract installer from the official repository here and follow the installation instructions.

Installing wkhtmltopdf

Linux (Debian/Ubuntu)

sudo apt-get update
sudo apt-get install wkhtmltopdf

MacOS

Using Homebrew:

brew install wkhtmltopdf

Windows

Download the Windows installer from here and install it.

Installing OpenCV

To install OpenCV for image preprocessing, run:

pip install opencv-python

Installing the `pdfcompare` Package

Once all dependencies are installed, you can install pdfcompare via pip:

pip install pdfcompare

Usage

Command-Line Interface (CLI)

pdfcompare provides an intuitive command-line interface for comparing files and generating reports.

Basic Syntax

pdfcompare file1 file2 --output txt
pdfcompare file1 file2 --output html
pdfcompare file1 file2 --output pdf

Example

pdfcompare document1.pdf document2.docx --output html

This command compares document1.pdf and document2.docx, and saves the comparison result as an HTML report.

Options

file1, file2: Paths to the files you want to compare.
--output: Specify the format for the report (options: txt, html, pdf).

Advanced Image Preprocessing for OCR

The pdfcompare package now supports advanced image preprocessing using OpenCV to improve OCR accuracy. This includes steps like binarization, noise removal, and other enhancements before performing text extraction.

Programmatic Usage

pdfcompare can be used as a Python module within your code.

from pdfcompare.cli import compare_files

file1 = "path/to/file1.pdf"
file2 = "path/to/file2.docx"
output_format = "pdf"  # Choose from 'txt', 'html', or 'pdf'

compare_files(file1, file2, output_format)

from pdfcompare.file_handlers.image_handler import extract_text

text = extract_text("path/to/your/image.png")
print(text)

Testing

To run unit tests, first install the development dependencies, and then use:

python -m unittest discover tests/

Coverage of Tests:

Text extraction: From PDFs, DOCX files, and images.
File comparison logic: Ensures accurate and consistent differences between file contents.
Report generation: Tests for TXT, HTML, and PDF formats.
Image preprocessing: Tests the effectiveness of OpenCV preprocessing for OCR.

Contributing

Fork the repository.
Create your feature branch (git checkout -b feature/your-feature).
Commit your changes (git commit -am 'Add new feature').
Push to the branch (git push origin feature/your-feature).
Open a new Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Key Changes and Additions:

Advanced Image Preprocessing: Added details about preprocessing images using OpenCV before performing OCR to improve accuracy.
Python Version Requirement: Updated to require Python 3.7+.
Installation Section: Included OpenCV installation instructions.
Testing: Added specifics about testing image preprocessing with OpenCV and OCR.
Programmatic Usage: Clarified how to use the package as a Python module.

Changelog

Version 0.2.0

Added advanced image preprocessing (grayscale, binarization, and noise removal) using OpenCV to improve OCR accuracy.
Modularized the extract_text function for better maintainability.

Installation

To install the latest version:

pip install pdfcompare --upgrade

FAQs

What is pdfcompare?

Is pdfcompare well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

pdfcompare

PDFCompare

Features

Installation

Python Requirements

External Dependencies

Installing Tesseract

Linux (Debian/Ubuntu)

MacOS

Windows

Installing wkhtmltopdf

Linux (Debian/Ubuntu)

MacOS

Windows

Installing OpenCV

Installing the pdfcompare Package

Usage

Command-Line Interface (CLI)

Basic Syntax

Example

Options

Advanced Image Preprocessing for OCR

Programmatic Usage

Testing

Coverage of Tests:

Contributing

License

Key Changes and Additions:

Changelog

Version 0.2.0

Installation

Related posts

Oracle Drags Its Feet in the JavaScript Trademark Dispute

Linux Foundation Warns Open Source Developers: Compliance with Sanctions Is Not Optional

Installing the `pdfcompare` Package