German-Brazilian Newspapers (gbn)
This project aims at providing an OCR-D compliant toolset for optical layout recognition/analysis on images of historical german-language documents published in Brazil during the 19th and 20th centuries, focusing on periodical publications.
Table of contents
About
Although there is a considerable amount of digitized brazilian-published german-language periodicals available online (e.g. the dbp digital collection and the German-language periodicals section of the Brazilian (National) Digital Library), document image understanding of these prints is far from being optimal. While generic OCR solutions will work out of the box with typical everyday-life documents, it is a different story for historical newspapers like those due to several factors:
- Complex layouts (still a challenge for mainstream OCR toolsets e.g. ocropy and tesseract)
- Degradation over time (e.g. stains, rips, erased ink)
- Poor scanning quality (e.g. lighting contrast)
In order to achieve better full-text recognition results on the target documents, this project relies on two building blocks: The German-Brazilian Newspapers dataset and the ocrd-sbb-textline-detector tool. The first as a role-model for pioneering on layout analysis of german-brazilian documents (and also as a source of testing data) and the latter as a reference implementation of a robust layout analysis workflow for german-language documents. This project itself was forked from ocrd-sbb-textline-detector, aiming at replicating the original tool's functionality into several smaller modules and extending it for more powerful workflows.
Installation
pip3 install git+https://github.com/sulzbals/gbn.git
Usage
Refer to the OCR-D CLI documentation for instructions on running OCR-D tools.
Tools (gbn.sbb)
ocrd-gbn-sbb-predict
{
"executable": "ocrd-gbn-sbb-predict",
"categories": [
"Layout analysis"
],
"description": "Classifies pixels of input images given a binary (two classes) model and store the prediction as the specified PAGE-XML content type",
"steps": [
"layout/analysis"
],
"input_file_grp": [
"OCR-D-IMG",
"OCR-D-BIN"
],
"output_file_grp": [
"OCR-D-PREDICT"
],
"parameters": {
"model": {
"type": "string",
"description": "Path to Keras model to be used",
"required": true,
"cacheable": true
},
"shaping": {
"type": "string",
"description": "How the images must be processed in order to match the input shape of the model ('resize' for resizing to model shape and 'split' for splitting into patches)",
"required": true,
"enum": [
"resize",
"split"
]
},
"type": {
"type": "string",
"description": "PAGE-XML content type to be predicted",
"required": true,
"enum": [
"AlternativeImageType",
"BorderType",
"TextRegionType",
"TextLineType"
]
},
"operation_level": {
"type": "string",
"description": "PAGE-XML hierarchy level to operate on",
"default": "page",
"enum": [
"page",
"region",
"line"
]
}
}
}
ocrd-gbn-sbb-crop
{
"executable": "ocrd-gbn-sbb-crop",
"categories": [
"Image preprocessing",
"Layout analysis"
],
"description": "Crops the input page images by predicting the actual page surface and setting the PAGE-XML Border accordingly",
"steps": [
"preprocessing/optimization/cropping",
"layout/analysis"
],
"input_file_grp": [
"OCR-D-IMG"
],
"output_file_grp": [
"OCR-D-CROP"
],
"parameters": {
"model": {
"type": "string",
"description": "Path to Keras model to be used",
"required": true,
"cacheable": true
},
"shaping": {
"type": "string",
"description": "How the images must be processed in order to match the input shape of the model ('resize' for resizing to model shape and 'split' for splitting into patches)",
"default": "resize",
"enum": [
"resize",
"split"
]
}
}
}
ocrd-gbn-sbb-binarize
{
"executable": "ocrd-gbn-sbb-binarize",
"categories": [
"Image preprocessing",
"Layout analysis"
],
"description": "Binarizes the input page images by predicting their foreground pixels and saving it as a PAGE-XML AlternativeImage",
"steps": [
"preprocessing/optimization/binarization",
"layout/analysis"
],
"input_file_grp": [
"OCR-D-IMG"
],
"output_file_grp": [
"OCR-D-BIN"
],
"parameters": {
"model": {
"type": "string",
"description": "Path to Keras model to be used",
"required": true,
"cacheable": true
},
"shaping": {
"type": "string",
"description": "How the images must be processed in order to match the input shape of the model ('resize' for resizing to model shape and 'split' for splitting into patches)",
"default": "split",
"enum": [
"resize",
"split"
]
},
"operation_level": {
"type": "string",
"description": "PAGE-XML hierarchy level to operate on",
"default": "page",
"enum": [
"page",
"region",
"line"
]
}
}
}
ocrd-gbn-sbb-segment
{
"executable": "ocrd-gbn-sbb-segment",
"categories": [
"Layout analysis"
],
"description": "Segments the input page images by predicting the text regions and lines and setting the PAGE-XML TextRegion and TextLine accordingly",
"steps": [
"layout/segmentation/region",
"layout/segmentation/line"
],
"input_file_grp": [
"OCR-D-DESKEW"
],
"output_file_grp": [
"OCR-D-SEG"
],
"parameters": {
"region_model": {
"type": "string",
"description": "Path to Keras model to be used for predicting text regions",
"default": "",
"cacheable": true
},
"region_shaping": {
"type": "string",
"description": "How the images must be processed in order to match the input shape of the model ('resize' for resizing to model shape and 'split' for splitting into patches)",
"default": "split",
"enum": [
"resize",
"split"
]
},
"line_model": {
"type": "string",
"description": "Path to Keras model to be used for predicting text lines",
"required": true,
"cacheable": true
},
"line_shaping": {
"type": "string",
"description": "How the images must be processed in order to match the input shape of the model ('resize' for resizing to model shape and 'split' for splitting into patches)",
"default": "split",
"enum": [
"resize",
"split"
]
}
}
}
Library (gbn.lib)
This small library provides an abstraction layer that the OCR-D processors contained in this project should use for performing common image processing and deep learning routines. Those processors therefore should not directly access libraries like OpenCV, Numpy or Keras.
Check the source code files for detailed documentation on each class and function of the library.
Models
Currently the models being used are the ones provided by the qurator team. Models for binarization can be found here and for cropping and segmentation here.
There are plans for extending the GBN dataset with more degraded document pages as an attempt to train robust models in the near future.
Recommended Workflow
The most generic and simple processing step implementations of ocrd-sbb-textline-detector were not implemented since there are already tools that do effectively the same. The resizing to 2800 pixels of height is performed through an imagemagick wrapper for OCR-D (ocrd-im6convert) and the deskewing through an ocropy wrapper (ocrd-cis-ocropy).
Step | Processor | Parameters |
---|
1 | ocrd-im6convert | { "output-format": "image/png", "output-options": "-geometry x2800" } |
2 | ocrd-gbn-sbb-crop | { "model": "/path/to/model_page_mixed_best.h5", "shaping": "resize" } |
3 | ocrd-gbn-sbb-binarize | { "model": "/path/to/model_bin4.h5", "shaping": "split", "operation_level": "page" } |
4 | ocrd-cis-ocropy-deskew | { "level-of-operation": "page" } |
5 | ocrd-gbn-sbb-segment | { "region_model": "/path/to/model_strukturerkennung.h5", "region_shaping": "split", "line_model": "/path/to/model_textline_new.h5", "line_shaping": "split" } |