
Security News
npm Adopts OIDC for Trusted Publishing in CI/CD Workflows
npm now supports Trusted Publishing with OIDC, enabling secure package publishing directly from CI/CD workflows without relying on long-lived tokens.
A powerful Python toolkit for generating synthetic datasets for Optical Character Recognition (OCR) model training and evaluation. This toolkit enables generating realistic text images with configurable backgrounds, fonts, augmentations, and ground-truth labels, supporting research and production needs for OCR systems.
The OCR Data Toolkit (ocr_data_toolkit
) is designed to generate high-quality synthetic data for OCR applications. It provides:
Synthetic data is essential for training robust OCR models, especially when annotated real-world data is scarce or expensive to collect. This toolkit helps you simulate diverse real-world scenarios, improving model generalization.
Pillow
, numpy
, opencv-python
, matplotlib
, atpbar
pip install ocr-data-toolkit
Or clone and install locally:
git clone https://github.com/NaumanHSA/ocr-data-toolkit.git
cd ocr-data-toolkit
pip install .
from ocr_data_toolkit import ODT
# Simple example: generate a single image
odt = ODT(language="en")
text, image = odt.generate_single_image()
image.save("sample.png")
print("Ground truth:", text)
Configuration is managed via the Config
dataclass (see ocr_data_toolkit/common/config.py
).
Parameter Reference:
Parameter | Type | Description |
---|---|---|
language | str | Language code (e.g., en ). If not specified, defaults to English. |
bag_of_words | List[str] | List of words for random text. If not provided, loads built-in vocabulary for the selected language. |
backgrounds_path | str | Path to background images. If not set, uses built-in backgrounds shipped with the package. |
fonts_path | str | Path to font files. If not set, uses built-in fonts for the selected language. |
text_probs | Dict[str, float] | Probabilities for generating text , date , or number . Defaults to {text: 0.7, date: 0.1, number: 0.2} . |
output_image_size | Tuple[int, int] | Output image size (width, height). If None , uses the natural size of generated images. |
train_test_ratio | float | Ratio for splitting train/test sets. Default is 0.2 (20% test). |
output_save_path | str | Where to save generated data. Defaults to an export folder in the project root. |
augmentation_config | Dict | Augmentation settings (see below). If not provided, uses sensible defaults. |
num_workers | int | Number of parallel workers for data generation. Default is 4. |
Defaults and Auto-loading:
backgrounds_path
or fonts_path
, the toolkit will automatically use built-in backgrounds and fonts for the selected language (see data/
directory).bag_of_words
, the toolkit loads a built-in vocabulary file for the selected language.Example:
from ocr_data_toolkit import ODT
config = {
"language": "en",
"output_image_size": (128, 32),
"augmentation_config": {"max_num_words": 5, "num_lines": 2},
}
odt = ODT(**config)
Augmentations are controlled by the AugmentationConfig
class (see ocr_data_toolkit/common/config.py
). You can override any default by passing a dictionary to augmentation_config
.
Detailed Augmentation Parameters:
Parameter | Type | Default | Description |
---|---|---|---|
max_num_words | int | 5 | Maximum number of words per generated text sample. |
num_lines | int | 1 | Number of lines per image. |
font_size | int | 36 | Font size for rendered text. |
text_colors | List[str] | ["#2f2f2f", "black", "#404040"] | List of possible text colors (hex or names). |
letter_spacing_prob | float | 0.4 | Probability of applying random letter spacing. |
margin_x , margin_y | Tuple[float, float] | (0.5, 1.5) | Padding factors for horizontal/vertical margins, as a multiple of character width/height. |
blur_probs | Dict[str, float] | {gaussian: 0.3, custom_blurs: 0.6} | Probabilities for applying Gaussian blur and custom blurs (motion, bokeh). |
moire_prob | float | 0.3 | Probability of overlaying a moire pattern. |
opacity_prob | float | 0.3 | Probability of applying random opacity to the image. |
opacity_range | Tuple[int, int] | (150, 210) | Range of alpha values for random opacity. |
brightness_range | Tuple[float, float] | (0.7, 1.2) | Range for random brightness adjustment. |
perspective_transform_prob | float | 0.3 | Probability of applying a random perspective transformation. |
random_crop_width_range | Tuple[float, float] | (0.008, 0.01) | Range for random cropping width as a fraction of image width. |
random_crop_height_range | Tuple[float, float] | (0.01, 0.1) | Range for random cropping height as a fraction of image height. |
random_resize_factor_range | Tuple[float, float] | (0.9, 1.0) | Range for random resizing factor. |
random_stretch_factor_range | Tuple[float, float] | (0.1, 0.3) | Range for random vertical stretch/compression. |
Parameter Explanations:
margin_x=(0.5, 1.5)
means horizontal margins are randomly chosen between 0.5x–1.5x the character width.Example:
augmentation = {
"max_num_words": 8,
"num_lines": 2,
"font_size": 40,
"blur_probs": {"gaussian": 0.5, "custom_blurs": 0.7},
"moire_prob": 0.4,
}
odt = ODT(augmentation_config=augmentation)
All generated images are created with the specified output_image_size
(width, height). If this parameter is set to None
, the image will use its natural size based on the rendered text and font. When resizing is needed, the toolkit uses a resize and pad strategy:
This ensures that all images in your dataset have a consistent size, suitable for training deep learning models.
When you generate a dataset using generate_training_data
, the toolkit creates:
images/
directory: Contains all generated images (PNG format by default).gt.txt
: Ground truth file mapping each image filename to its corresponding text label.Example gt.txt
entry:
images/000001.png The quick brown fox
images/000002.png 13/01/2023
<relative_path_to_image>\t<text_label>
(tab-separated).train/
and test/
directories are created, each with their own images and gt.txt
file, according to the train_test_ratio
.odt = ODT(language="en")
text, img = odt.generate_single_image()
img.save("test.png")
odt = ODT(language="en", output_image_size=(128, 32), num_workers=4)
odt.generate_training_data(num_samples=1000)
odt.visualize_font_catalog(save_dir="font_catalog", chunk_size=10)
Take a look at example.py
for example usage.
ocr_data_toolkit/
├── odt.py # Main toolkit interface (ODT class)
├── generators/
│ ├── en.py # English text-image generator (ENGenerator)
│ └── ... # Other language generators
├── helper/
│ ├── augmentation.py # Augmentation operations (Augmentation class)
│ └── utils.py # Utility functions (image, fonts, backgrounds)
├── common/
│ └── config.py # Config and AugmentationConfig classes
├── data/ # Sample data (backgrounds, fonts, vocabularies)
└── ...
ODT
class: Main entry point for data generation, configuration, and utilities.Augmentation
class: Implements all augmentation methods (noise, blur, moire, distortion, etc).generators.py
: Contains text-image synthesis logic for different languages.utils.py
: Helper functions for resizing, font selection, backgrounds, etc.config.py
: Centralizes default and user configuration.generators/
and update config.Augmentation
class with your custom method.ODT
or ENGenerator
for advanced use cases.We welcome and encourage contributions of all kinds!
generators/
and update the config with the appropriate fonts and vocabulary.helper/augmentation.py
.Please:
This project is licensed under the MIT License. See the LICENSE file for details.
For questions, suggestions, or support, please open an issue or contact the author at naumanhsa965@gmail.com.
FAQs
A toolkit for generating synthetic data for OCR
We found that ocr-data-toolkit demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
npm now supports Trusted Publishing with OIDC, enabling secure package publishing directly from CI/CD workflows without relying on long-lived tokens.
Research
/Security News
A RubyGems malware campaign used 60 malicious packages posing as automation tools to steal credentials from social media and marketing tool users.
Security News
The CNA Scorecard ranks CVE issuers by data completeness, revealing major gaps in patch info and software identifiers across thousands of vulnerabilities.