
Product
Introducing Reachability for PHP
Reachability analysis for PHP is now available in experimental, helping teams identify which vulnerabilities are actually exploitable.
pdf-language-detector
Advanced tools
A python script to iterate over a list of PDF in a directory and try to guess their language with Tesseract OCR.
PLD is a Python program that analyzes PDF files, extracts images, processes them using Optical Character Recognition (OCR), and detects the dominant language of the text. It provides language detection information in JSON format and calculates the average confidence coefficient for each language.
Install Tesseract OCR and pdftoppm using your package manager. For example, on Ubuntu:
sudo apt install tesseract-ocr tesseract-ocr-all poppler-utils
Install with pip:
python3 -m pip install --user pdf-language-detector
Then run directly from your terminal:
pld --help
Clone the PLD repository:
git clone git@github.com:github.com/icij/pld.git
Install the required Python packages with poetry:
poetry install
Then run inside a virtual env managed by poetry:
poetry run pld --help
Install with Docker:
docker pull icij/pld
Then run inside a container:
docker run -it icij/pld pld --help
This command process PDF files and detect the dominant language.
pld detect --help
--language A list of ISO3 language codes to detect.
--input-dir: Path to the input directory containing PDF files. Default is the current directory.
--output-dir (optional): Path to the output directory. Default is 'out' directory in the current directory.
--max-pages (optional): Maximum number of pages to process per PDF file. Default is 5.
--resume (optional): Skip PDF files already analyzed.
--skip-images (optional): Skip the extraction of PDF files a images.
--skip-ocr (optional): Skip the OCR of images from PDF files.
--parallel (optional): Number of threads to run in parallel.
--relative-to (optional): Path to the directory relative to which build the output dir path.
This command print a report from the previously detected language (using the same output dir).
pld report --help
--output-dir: Path to the output directory. Default is 'out' directory in the current directory.
You can run the test suite (propulsed by pytest) with this command:
make test
Process PDF files in the current directory, detect English and Spanish languages, and save the results in the 'results' directory:
pld --language eng --language spa --input-dir documents --output-dir results
Process PDF files in the 'documents' directory, detect French and Greek languages, and limit the processing to 3 pages per file:
pld --language fra --language ell --input-dir documents --max-pages 3
This project is licensed under the MIT License.
FAQs
A python script to iterate over a list of PDF in a directory and try to guess their language with Tesseract OCR.
We found that pdf-language-detector demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 2 open source maintainers collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Product
Reachability analysis for PHP is now available in experimental, helping teams identify which vulnerabilities are actually exploitable.

Product
Export Socket alert data to your own cloud storage in JSON, CSV, or Parquet, with flexible snapshot or incremental delivery.

Research
/Security News
Bitwarden CLI 2026.4.0 was compromised in the Checkmarx supply chain campaign after attackers abused a GitHub Action in Bitwarden’s CI/CD pipeline.