
Research
Security News
The Growing Risk of Malicious Browser Extensions
Socket researchers uncover how browser extensions in trusted stores are used to hijack sessions, redirect traffic, and manipulate user behavior.
Chinese Version | English Version
Herbarium-OCR is an open-source OCR tool primarily designed for extracting text from floras, papers, and handwritten or printed labels of herbarium specimens from Central Eurasian countries, aiming to support research in plant systematics and ecology. It can also process scanned documents and photos from other regions and languages. Users are advised to first consider commercial OCR solutions for stable service and support, such as ABBYY, Google Document AI, and TextIn.
The workflow includes:
Supported OCR Engines:
gemini-2.0-flash
), Qwen (qwen-vl-plus
), ChatGLM (glm-4v-plus
). Theoretically compatible with local Ollama or vLLM.surya-ocr
): A Torch-based OCR engine (GitHub). The current integration requires approximately 7GB of VRAM per process for layout+OCR. Please check your CUDA device specifications before use.
(Note: Local OCR integration is not extensively tested).Image Preprocessing Features (Configurable):
herbarium-ocr-preprocess
(or python -m Main.image_processer
when running from source). This tool can apply the full preprocessing pipeline (including optional auto-rotation and all enhancement steps, behavior depends on the configuration file) to a single file or an entire directory of files. Note: Preprocessing many or high-resolution full images can be time-consuming. The processed files are saved under the input path for subsequent OCR with the main script.Output Formats: Supports Markdown, JSON, XML, and HTML. By default, only a full.json
file containing all details is generated. Other formats can be requested via the --output_format
argument.
Batch Processing: pdf_batch
and image_batch
modes support parallel processing using multiple processes. The number of worker processes is configurable (default is 1).
This project was developed by the author during graduate studies. Due to time constraints and research commitments, future maintenance and feature development will primarily rely on community collaboration. Users are encouraged to:
requirements.txt
):
toml
: For parsing configuration filespytesseract
: For auto-rotation featuresurya-ocr
: For local OCR (CUDA device with >8GB VRAM strongly recommended)requests
: For XFYun OCR servicesInstall Herbarium-OCR via PyPI:
pip install herbarium-ocr
To install with all optional features (Tesseract support, XFYun client, Surya client):
pip install "herbarium-ocr[full]"
Note: If enabling auto-rotation, you must separately install the Tesseract OCR engine for your operating system.
GPU Support: For accelerated processing, install a CUDA-enabled version of PyTorch from the PyTorch website.
Clone the repository if you need to contribute or use the latest development version:
From Gitee:
git clone https://gitee.com/esenzhou/Herbarium-OCR-Public.git
cd Herbarium-OCR-Public
From GitHub:
git clone https://github.com/GrootOtter/Herbarium-OCR-Public.git
cd Herbarium-OCR-Public
Install dependencies (ideally in a virtual environment):
pip install -r requirements.txt
Install optional dependencies (e.g., To enable auto-rotation):
pip install pytesseract
# Also install the Tesseract OCR engine itself (see below)
Install Tesseract OCR Engine (Only if enabling auto-rotation):
sudo apt install tesseract-ocr
(Debian/Ubuntu).GPU Support: Install a CUDA-enabled version of PyTorch from the PyTorch website.
Use the following command-line tools after installation:
herbarium-ocr
Process PDF or image files for OCR.
herbarium-ocr --mode <mode> --input <input_path> --model <model_name> [options]
pdf
, pdf_batch
, image
, image_batch
--languages
: Comma-separated language codes (e.g., hy,ru
)--output_format
: markdown
, json
, xml
, html
(generates this in addition to full.json
)--preprocess_images
: Enable image block enhancements-v, --verbose
: Enable debug logging-c, --config
: Path to custom TOML config fileExample:
herbarium-ocr --mode pdf --input document.pdf --model gemini --output_format html
herbarium-ocr-convert
Convert an existing full.json
output file to other formats.
herbarium-ocr-convert <input_path_full.json> --to <format>... [-v]
markdown
, md
, html
, htm
, xml
, json
(filtered version)Example:
herbarium-ocr-convert output_full.json --to markdown html
herbarium-ocr-preprocess
Test the preprocessing pipeline (rotation attempt, enhancements).
herbarium-ocr-preprocess --input <input_path> [-c <config_path>] [-v]
Example:
herbarium-ocr-preprocess --input image.jpg
herbarium-ocr-check-layout
Display supported layout classes from the built-in model.
herbarium-ocr-check-layout [-c <config_path>] [-v]
Example:
herbarium-ocr-check-layout -c my_config.toml -v
If you cloned the repository, run scripts from the project root using python -m
:
python -m Main.herbarium_ocr --mode <mode> --input <input_path> --model <model_name> [options]
Example:
python -m Main.herbarium_ocr --mode pdf --input document.pdf --model gemini --output_format html
python -m Main.convert <input_path_full.json> --to <format>... [-v]
Example:
python -m Main.convert output_full.json --to markdown
python -m Main.image_processer --input <input_path> [-c <config_path>] [-v]
Example:
python -m Main.image_processer --input image.jpg
python -m Main.check_layout_model [-c <config_path>] [-v]
Example:
python -m Main.check_layout_model -c my_config.toml -v
Customize via herbarium_ocr_config.toml
. Search order: -c
path > User dir > Defaults. Only include settings you want to override.
Example Config (herbarium_ocr_config.toml
):
[OCR_CONFIG]
languages = "en,ru" # Default language hints
output_format = "html" # Default conversion format
preprocess_images = true # Enable block enhancement
enhance_contrast = true
denoise = false # Disable slow denoising
sharpen = true
attempt_auto_rotation = true # Enable Tesseract rotation
# tesseract_cmd_path = "/usr/local/bin/tesseract" # Tesseract path (Example)
min_rotation_confidence = 50
max_workers = 0 # Use all CPU cores for batch
[DOCLAYOUT_CONFIG]
RELEVANT_TEXT_CLASSES = ["title", "plain text"]
DOCLAYOUT_CONF_THRESHOLD = 0.25
[MODEL_CONFIGS]
# Add a new model definition (Example using OpenRouter)
[MODEL_CONFIGS.openrouter] # Name used with the --model argument (e.g., --model openrouter)
type = "openai_compatible" # Specifies which client handles this (OpenAI compatible)
language_mode = "list_hint" # How the client uses the --languages arg (accepts list as hint)
api_key_env = "OPENROUTER_API_KEY" # Environment variable name holding the API key
base_url = "https://openrouter.ai/api/v1" # Base URL for the API endpoint (provider: OpenRouter)
model_id = "google/gemma-3-27b-it:free" # Specific model identifier (get from provider's documentation)
rpm_limit = 20 # Requests Per Minute limit (check provider's documentation/limits)
# Add local Ollama model (Example, untested)
[MODEL_CONFIGS.ollama_llava]
type = "openai_compatible"
language_mode = "list_hint"
api_key_env = "OLLAMA_API_KEY" # Can be dummy value like "ollama"
base_url = "http://localhost:11434/v1"
model_id = "gemma3:27b" # Your loaded model name
rpm_limit = 10
max_dimension = 0 # Disable client image processing
# Modify existing gemini config
[MODEL_CONFIGS.gemini]
model_id = "gemini-2.0-flash-lite"
rpm_limit = 30
# Modify XFyun printed OCR params
[MODEL_CONFIGS.xfyun-printed-ocr]
param_value = "ru" # Default language Russian
max_dimension = 2000
jpeg_quality = 90
Note: Run herbarium-ocr-check-layout
or python -m Main.check_layout_model
to see supported RELEVANT_TEXT_CLASSES
.
API Key/Credential Setup (Environment Variables):
OpenAI-Compatible Models:
Obtain the corresponding API keys from the respective LLM provider’s website. Before using this project, you can send a text image to test if the model supports it. This project accesses the following models (--model
parameter):
Set environment variables:
Temporary Setup (current session only):
export GOOGLE_API_KEY="your-google-api-key" # For Gemini
export XAI_API_KEY="your-xai-api-key" # For Grok
export DASHSCOPE_API_KEY="your-dashscope-api-key" # For Qwen
export ZHIPUAI_API_KEY="your-zhipuai-api-key" # For GLM-4
export YI_API_KEY="your-yi-api-key" # For Yi
...
Permanent Setup :
Add the above export
commands to your shell configuration file (e.g., ~/.bashrc
, ~/.zshrc
):
echo 'export GOOGLE_API_KEY="your-google-api-key"' >> ~/.bashrc
Reload the shell configuration:
source ~/.bashrc # or source ~/.zshrc
Temporary Setup (current session only):
Open PowerShell and run:
$env:GOOGLE_API_KEY = "your-google-api-key"
Permanent Setup :
[System.Environment]::SetEnvironmentVariable("GOOGLE_API_KEY", "your-google-api-key", "User")
Alternatively, set environment variables via the GUI:
GOOGLE_API_KEY
) and value (e.g., your-google-api-key
).XFYun OCR API (--model
parameter: xfyun-general-ocr
, xfyun-printed-ocr
):
SPARK_APPID
, SPARK_API_KEY
, SPARK_API_SECRET
.RELEVANT_TEXT_CLASSES
and DOCLAYOUT_CONF_THRESHOLD
in config. Enabling herbarium-ocr-preprocess
might improve detection confidence.-v
to verify environment variables are correctly set and checked.pytesseract
library are installed and configured correctly (PATH or tesseract_cmd_path
).Contributions are welcome! Please use:
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See the LICENSE file. The included doclayout_yolo_docstructbench_imgsz1024.pt
model file is also under AGPL-3.0.
Thanks to the developers of key open-source projects and libraries such as DocLayout-YOLO, PyMuPDF, Pillow, OpenAI Python SDK, Requests, Tesseract OCR, and PyTorch. Special thanks to Gemini and Grok for their code instructions. Also, thanks to the Herbarium of Xinjiang Institute of Ecology and Geography, CAS (XJBI) for supporting this work.
FAQs
OCR tool for botanical documents using layout analysis and LLMs/OCR engines.
We found that herbarium-ocr demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket researchers uncover how browser extensions in trusted stores are used to hijack sessions, redirect traffic, and manipulate user behavior.
Research
Security News
An in-depth analysis of credential stealers, crypto drainers, cryptojackers, and clipboard hijackers abusing open source package registries to compromise Web3 development environments.
Security News
pnpm 10.12.1 introduces a global virtual store for faster installs and new options for managing dependencies with version catalogs.