šŸš€ Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more →
Socket
Sign inDemoInstall
Socket

ocrd-page-to-alto

Package Overview
Dependencies
Maintainers
2
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

ocrd-page-to-alto

Convert PAGE (v. 2019) to ALTO (v. 2.0 - 4.2)

2.1.0
PyPI
Maintainers
2

ocrd-page-to-alto

Convert PAGE (v. 2019) to ALTO (v. 2.0 - 4.2)

CircleCI

Introduction

This software converts PAGE XML files to the ALTO XML OCR result format. It enables using PAGE XML generating software in a context where ALTO is needed to display the results, i.e. in libraries.

Installation

In a Python virtualenv:

make install     # or pip install .
# or to install from PyPI
pip install ocrd_page_to_alto

Usage

To convert the PAGE XML document example.xml to ALTO:

page-to-alto example.xml > example.alto.xml

You can get an exhaustive list of page-to-alto's many options with --help:

CLI

Usage: page-to-alto [OPTIONS] FILENAME
  Convert PAGE to ALTO
Options:
  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
                                  Log level
  --alto-version [4.2|4.1|4.0|3.1|3.0|2.1|2.0]
                                  Choose version of ALTO-XML schema to produce
                                  (older versions may not preserve all
                                  features)
  --check-words / --no-check-words
                                  Check whether PAGE-XML contains any Words
                                  and fail if not
  --check-border / --no-check-border
                                  Check whether PAGE-XML contains Border or
                                  PrintSpace
  --skip-empty-lines / --no-skip-empty-lines
                                  Whether to omit or keep empty lines in PAGE-
                                  XML
  --trailing-dash-to-hyp / --no-trailing-dash-to-hyp
                                  Whether to add a  element if the last
                                  word in a line ends in "-"
  --dummy-textline / --no-dummy-textline
                                  Whether to create a TextLine for regions
                                  that have TextEquiv/Unicode but no TextLine
  --dummy-word / --no-dummy-word  Whether to create a Word for TextLine that
                                  have TextEquiv/Unicode but no Word
  --textequiv-index INTEGER       If multiple textequiv, use the n-th
                                  TextEquiv by @index
  --textequiv-fallback-strategy [raise|first|last]
                                  What to do if nth textequiv isn't available.
                                  'raise' will lead to a runtime error,
                                  'first' will use the first TextEquiv, 'last'
                                  will use the last TextEquiv on the element
  -O, --output-file FILE          Output filename (or "-" for standard output,
                                  the default)
  -h, --help                      Show this message and exit.

To process an OCR-D workspace, use ocrd_fileformat, which uses page-to-alto by default:

ocrd-fileformat-transform -I OCRD-OCR-OUTPUT-PAGE -O OCRD-OCR-OUTPUT-ALTO \
  -P script-args "--dummy-word --no-check-words --no-check-border"

TODO

  • AlternativeImage
  • unmappable regions
  • handle Border
  • TextStyle
  • ParagraphStyle
  • table regions
  • recursive regions for *Region
  • Set PAGECLASS from pc:Page/@type #4
  • Layers / z-level via StructureTag? #4
  • <SP/>
  • <HYP/>
  • rotation
  • reading order
  • input PAGE-XML not having words #5
  • multiple pc:TextEquivs
  • language
  • script no equivalent in ALTO :(
  • kerning no equivalent in ALTO :(
  • underlineStyle no equivalent in ALTO :(
  • bgColour no equivalent in ALTO :(
  • bgColourRgb no equivalent in ALTO :(
  • reverseVideo no equivalent in ALTO :(
  • xHeight no equivalent in ALTO :(
  • letterSpaced no equivalent in ALTO :(
  • ProcessingStep
  • differentiate/select ALTO versions

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts