
Security News
Django Joins curl in Pushing Back on AI Slop Security Reports
Django has updated its security policies to reject AI-generated vulnerability reports that include fabricated or unverifiable content.
A Python OCR package for extracting text from scanned and native images or PDFs, designed for integration into AI applications such as LLMs and RAG, and to produce clean plain text (.txt
) and Markdown (.md
) files.
.txt
) and Markdown (.md
) formats.Important: This package requires the Tesseract OCR engine to be installed on your system. rostaing-ocr
is a Python wrapper that calls the tesseract
command-line tool. You must install it and its language packs first.
French
for fra
, English
is usually included by default).brew install tesseract
You can add language packs by installing tesseract-lang
.
sudo apt update
sudo apt install tesseract-ocr
# Also install the language packs you need. For French:
sudo apt install tesseract-ocr-fra
To keep project dependencies isolated and avoid conflicts with other Python projects on your system, it is highly recommended to use a virtual environment.
With a standard Python installation, you can create and activate a new environment using the following commands:
On macOS/Linux:
# Create an environment named '.venv' in your project directory
python3 -m venv .venv
# Activate the environment
source .venv/bin/activate
On Windows:
# Create an environment named '.venv' in your project directory
python -m venv .venv
# Activate the environment
.venv\Scripts\activate
Once Tesseract is set up and your virtual environment is activated, you can install the package from PyPI:
pip install rostaing-ocr
Here is a basic example of how to use the RostaingOCR
class.
from rostaing_ocr import RostaingOCR
# --- Example 1: Process a single file ---
# This will create 'my_result.txt' and 'my_result.md' in the current directory.
extractor = RostaingOCR(
input_path_or_paths="path/to/my_document.pdf",
output_basename="my_result", # Optionel
print_to_console=True # Optionel
)
# You can print to get a summary of the operation.
print(extractor)
# --- Example 2: Process multiple files and print to console ---
# This will process both files, save a consolidated output, and also print the results.
multi_extractor = RostaingOCR(
input_path_or_paths=["document1.png", "scan_page_2.pdf"],
output_basename="combined_report", # Optionel
print_to_console=True, # Optionel
languages=['fra', 'eng'] # Specify languages for Tesseract # Optionel
)
# You can print the object to get a summary of the operation.
print(multi_extractor)
Large Language Models (LLMs) like GPT-4 or Llama understand text, not images or scanned documents. A vast amount of valuable knowledge is locked away in unstructured formats such as PDFs of research papers, scanned invoices, or legal contracts.
Rostaing OCR
serves as the crucial first step in any data ingestion pipeline for Retrieval-Augmented Generation (RAG) systems. It bridges the gap by converting this inaccessible visual data into clean, structured text that LLMs can process.
By using Rostaing OCR
, you can automate the process of building a knowledge base from your documents:
Scanned PDFs
or Images
.Markdown/Text
.In short, Rostaing OCR
unlocks your documents, making them ready for any modern AI stack.
This project is licensed under the MIT License. See the LICENSE
file for more details.
FAQs
An OCR tool to extract text from scanned and non-scanned images and PDFs.
We found that rostaing-ocr demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Django has updated its security policies to reject AI-generated vulnerability reports that include fabricated or unverifiable content.
Security News
ECMAScript 2025 introduces Iterator Helpers, Set methods, JSON modules, and more in its latest spec update approved by Ecma in June 2025.
Security News
A new Node.js homepage button linking to paid support for EOL versions has sparked a heated discussion among contributors and the wider community.