🚀 Socket Launch Week 🚀 Day 4: Introducing Historical Analytics.Learn More →

ocrd-gbn

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

ocrd-gbn

Collection of OCR-D compliant tools for layout analysis and segmentation of historical german-language documents published in Brazil

1.0.0

PyPI

Maintainers: 1

German-Brazilian Newspapers (gbn)

This project aims at providing an OCR-D compliant toolset for optical layout recognition/analysis on images of historical german-language documents published in Brazil during the 19th and 20th centuries, focusing on periodical publications.

About

Although there is a considerable amount of digitized brazilian-published german-language periodicals available online (e.g. the dbp digital collection and the German-language periodicals section of the Brazilian (National) Digital Library), document image understanding of these prints is far from being optimal. While generic OCR solutions will work out of the box with typical everyday-life documents, it is a different story for historical newspapers like those due to several factors:

Complex layouts (still a challenge for mainstream OCR toolsets e.g. ocropy and tesseract)
Degradation over time (e.g. stains, rips, erased ink)
Poor scanning quality (e.g. lighting contrast)

In order to achieve better full-text recognition results on the target documents, this project relies on two building blocks: The German-Brazilian Newspapers dataset and the ocrd-sbb-textline-detector tool. The first as a role-model for pioneering on layout analysis of german-brazilian documents (and also as a source of testing data) and the latter as a reference implementation of a robust layout analysis workflow for german-language documents. This project itself was forked from ocrd-sbb-textline-detector, aiming at replicating the original tool's functionality into several smaller modules and extending it for more powerful workflows.

Installation

pip3 install git+https://github.com/sulzbals/gbn.git

Usage

Refer to the OCR-D CLI documentation for instructions on running OCR-D tools.

Tools (gbn.sbb)

ocrd-gbn-sbb-predict

{
 "executable": "ocrd-gbn-sbb-predict",
 "categories": [
  "Layout analysis"
 ],
 "description": "Classifies pixels of input images given a binary (two classes) model and store the prediction as the specified PAGE-XML content type",
 "steps": [
  "layout/analysis"
 ],
 "input_file_grp": [
  "OCR-D-IMG",
  "OCR-D-BIN"
 ],
 "output_file_grp": [
  "OCR-D-PREDICT"
 ],
 "parameters": {
  "model": {
   "type": "string",
   "description": "Path to Keras model to be used",
   "required": true,
   "cacheable": true
  },
  "shaping": {
   "type": "string",
   "description": "How the images must be processed in order to match the input shape of the model ('resize' for resizing to model shape and 'split' for splitting into patches)",
   "required": true,
   "enum": [
    "resize",
    "split"
   ]
  },
  "type": {
   "type": "string",
   "description": "PAGE-XML content type to be predicted",
   "required": true,
   "enum": [
    "AlternativeImageType",
    "BorderType",
    "TextRegionType",
    "TextLineType"
   ]
  },
  "operation_level": {
   "type": "string",
   "description": "PAGE-XML hierarchy level to operate on",
   "default": "page",
   "enum": [
    "page",
    "region",
    "line"
   ]
  }
 }
}

ocrd-gbn-sbb-crop

{
 "executable": "ocrd-gbn-sbb-crop",
 "categories": [
  "Image preprocessing",
  "Layout analysis"
 ],
 "description": "Crops the input page images by predicting the actual page surface and setting the PAGE-XML Border accordingly",
 "steps": [
  "preprocessing/optimization/cropping",
  "layout/analysis"
 ],
 "input_file_grp": [
  "OCR-D-IMG"
 ],
 "output_file_grp": [
  "OCR-D-CROP"
 ],
 "parameters": {
  "model": {
   "type": "string",
   "description": "Path to Keras model to be used",
   "required": true,
   "cacheable": true
  },
  "shaping": {
   "type": "string",
   "description": "How the images must be processed in order to match the input shape of the model ('resize' for resizing to model shape and 'split' for splitting into patches)",
   "default": "resize",
   "enum": [
    "resize",
    "split"
   ]
  }
 }
}

ocrd-gbn-sbb-binarize

{
 "executable": "ocrd-gbn-sbb-binarize",
 "categories": [
  "Image preprocessing",
  "Layout analysis"
 ],
 "description": "Binarizes the input page images by predicting their foreground pixels and saving it as a PAGE-XML AlternativeImage",
 "steps": [
  "preprocessing/optimization/binarization",
  "layout/analysis"
 ],
 "input_file_grp": [
  "OCR-D-IMG"
 ],
 "output_file_grp": [
  "OCR-D-BIN"
 ],
 "parameters": {
  "model": {
   "type": "string",
   "description": "Path to Keras model to be used",
   "required": true,
   "cacheable": true
  },
  "shaping": {
   "type": "string",
   "description": "How the images must be processed in order to match the input shape of the model ('resize' for resizing to model shape and 'split' for splitting into patches)",
   "default": "split",
   "enum": [
    "resize",
    "split"
   ]
  },
  "operation_level": {
   "type": "string",
   "description": "PAGE-XML hierarchy level to operate on",
   "default": "page",
   "enum": [
    "page",
    "region",
    "line"
   ]
  }
 }
}

ocrd-gbn-sbb-segment

{
 "executable": "ocrd-gbn-sbb-segment",
 "categories": [
  "Layout analysis"
 ],
 "description": "Segments the input page images by predicting the text regions and lines and setting the PAGE-XML TextRegion and TextLine accordingly",
 "steps": [
  "layout/segmentation/region",
  "layout/segmentation/line"
 ],
 "input_file_grp": [
  "OCR-D-DESKEW"
 ],
 "output_file_grp": [
  "OCR-D-SEG"
 ],
 "parameters": {
  "region_model": {
   "type": "string",
   "description": "Path to Keras model to be used for predicting text regions",
   "default": "",
   "cacheable": true
  },
  "region_shaping": {
   "type": "string",
   "description": "How the images must be processed in order to match the input shape of the model ('resize' for resizing to model shape and 'split' for splitting into patches)",
   "default": "split",
   "enum": [
    "resize",
    "split"
   ]
  },
  "line_model": {
   "type": "string",
   "description": "Path to Keras model to be used for predicting text lines",
   "required": true,
   "cacheable": true
  },
  "line_shaping": {
   "type": "string",
   "description": "How the images must be processed in order to match the input shape of the model ('resize' for resizing to model shape and 'split' for splitting into patches)",
   "default": "split",
   "enum": [
    "resize",
    "split"
   ]
  }
 }
}

Library (gbn.lib)

This small library provides an abstraction layer that the OCR-D processors contained in this project should use for performing common image processing and deep learning routines. Those processors therefore should not directly access libraries like OpenCV, Numpy or Keras.

Check the source code files for detailed documentation on each class and function of the library.

Models

Currently the models being used are the ones provided by the qurator team. Models for binarization can be found here and for cropping and segmentation here.

There are plans for extending the GBN dataset with more degraded document pages as an attempt to train robust models in the near future.

Recommended Workflow

The most generic and simple processing step implementations of ocrd-sbb-textline-detector were not implemented since there are already tools that do effectively the same. The resizing to 2800 pixels of height is performed through an imagemagick wrapper for OCR-D (ocrd-im6convert) and the deskewing through an ocropy wrapper (ocrd-cis-ocropy).

Step	Processor	Parameters
1	ocrd-im6convert	{ "output-format": "image/png", "output-options": "-geometry x2800" }
2	ocrd-gbn-sbb-crop	{ "model": "/path/to/model_page_mixed_best.h5", "shaping": "resize" }
3	ocrd-gbn-sbb-binarize	{ "model": "/path/to/model_bin4.h5", "shaping": "split", "operation_level": "page" }
4	ocrd-cis-ocropy-deskew	{ "level-of-operation": "page" }
5	ocrd-gbn-sbb-segment	{ "region_model": "/path/to/model_strukturerkennung.h5", "region_shaping": "split", "line_model": "/path/to/model_textline_new.h5", "line_shaping": "split" }

Keywords

OCR

OCR-D

FAQs

What is ocrd-gbn?

Is ocrd-gbn well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

ocrd-gbn

German-Brazilian Newspapers (gbn)

Table of contents

About

Installation

Usage

Tools (gbn.sbb)

ocrd-gbn-sbb-predict

ocrd-gbn-sbb-crop

ocrd-gbn-sbb-binarize

ocrd-gbn-sbb-segment

Library (gbn.lib)

Models

Recommended Workflow

Keywords

Related posts

ocrd-gbn

German-Brazilian Newspapers (gbn)

Table of contents

About

Installation

Usage

Tools (gbn.sbb)

ocrd-gbn-sbb-predict

ocrd-gbn-sbb-crop

ocrd-gbn-sbb-binarize

ocrd-gbn-sbb-segment

Library (gbn.lib)

Models

Recommended Workflow

Keywords

Related posts

Introducing Historical Analytics – Now in Beta

Introducing Module Reachability: Focus on the Vulnerabilities That Matter