
Product
Introducing Repository Access Permissions and Custom Roles
Socket now supports Custom Roles and Repository Access Permissions so organizations can control who can access specific repositories and actions.
ocr-my-mess
Advanced tools
A complete and modular Python pipeline to convert, OCR, and merge all your documents into a single, searchable PDF.
For the quickest start, pre-built executables for Windows, macOS, and Linux are available for download from the GitHub Releases page. These executables are standalone and do not require Python or any other dependencies to be installed on your system. Simply download the appropriate version for your operating system, extract it, and run.
ocrmypdf to make them text-searchable.ocr-my-mess-cli) and a simple Graphical User Interface (ocr-my-mess-gui).There are two main ways to install ocr-my-mess: from Conda or from PyPI.
This is the easiest and most reliable way to get started. The Conda environment, defined in the config/conda/environment.yml file, includes all Python dependencies as well as external binaries like Tesseract, Unpaper, and jbig2dec. This ensures you have the latest compiled versions, which are often more recent and performant than the ones provided by your operating system's package manager.
conda env create -f environment.yml
conda activate ocr-my-mess
ocr-my-mess
For development: If you want to modify the source code, you can install the project in editable mode after activating the environment:
pip install -e .
Note on LibreOffice: The Conda environment does not include LibreOffice. If you need to convert office documents, you must install it separately on your system (see the PyPI installation section for instructions).
This method requires you to install system dependencies manually before installing the Python package.
Install System Dependencies
This project relies on several external programs. Please install them using your system's package manager.
Linux (Debian/Ubuntu):
sudo apt-get update
sudo apt-get install -y tesseract-ocr unpaper jbig2dec libreoffice
macOS:
brew install tesseract unpaper jbig2dec
brew install --cask libreoffice
Windows:
Installation on Windows is more complex. We recommend using the official ocrmypdf Docker image if possible. Otherwise, you will need to install the following dependencies manually:
choco install tesseractchoco install libreofficeocrmypdf documentation for installation instructions.Optional Dependencies:
jbig2enc: For better PDF compression. See the ocrmypdf documentation for installation.Install ocr-my-mess from PyPI
pip install ocr-my-mess
The application can be run in two modes:
The CLI provides several commands, including run, convert and merge.
# General help
ocr-my-mess --help
# Get version
ocr-my-mess -v
# Run the full pipeline on a directory
ocr-my-mess run --input /path/to/docs --output /path/to/final.pdf --lang en+fr
# Just convert and OCR all documents in a folder
ocr-my-mess convert --input-dir /path/to/docs --output-dir /path/to/output
# Just merge all PDFs in a folder into a single file
ocr-my-mess merge --input-dir /path/to/output --output-file /path/to/final.pdf
For a more visual approach, you can launch the GUI by running the command without any arguments.
ocr-my-mess
This will open a window allowing you to:
To ensure everything is working correctly, run the automated tests:
pytest
This project uses PyInstaller to create a standalone executable. A build script is provided in the scripts/ directory.
# Build the executable
python scripts/build.py
The executable will be located in the dist/ directory.
Note: When running the GUI from the executable on Windows or macOS, a console window will appear alongside the main application window. This is expected behavior.
FAQs
A PDF pipeline to convert, OCR, and merge documents.
The pypi package ocr-my-mess receives a total of 59 weekly downloads. As such, ocr-my-mess popularity was classified as not popular.
We found that ocr-my-mess demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Product
Socket now supports Custom Roles and Repository Access Permissions so organizations can control who can access specific repositories and actions.

Product
Socket MCP now lets AI assistants review org alerts, investigate threats using the Socket threat feed, and inspect package files in addition to dependency scoring.

Product
Socket Firewall blocks malicious VS Code and Open VSX extensions before install, protecting developers from compromised editor marketplaces.