
Research
/Security News
9 Malicious NuGet Packages Deliver Time-Delayed Destructive Payloads
Socket researchers discovered nine malicious NuGet packages that use time-delayed payloads to crash applications and corrupt industrial control systems.
Crop, deskew, segment into regions / tables / lines / words, or recognize with tesserocr
This package offers OCR-D compliant workspace processors for (much of) the functionality of Tesseract via its Python API wrapper tesserocr. (Each processor is a parameterizable step in a configurable workflow of the OCR-D functional model. There are usually various alternative processor implementations for each step. Data is represented with METS and PAGE.)
It includes image preprocessing (cropping, binarization, deskewing), layout analysis (region, table, line, word segmentation), script identification, font style recognition and text recognition.
Most processors can operate on different levels of the PAGE hierarchy, depending on the workflow configuration. In PAGE, image results are referenced (read and written) via AlternativeImage, text results via TextEquiv, font attributes via TextStyle, script via @primaryScript, deskewing via @orientation, cropping via Border and segmentation via Region / TextLine / Word elements with Coords/@points.
This is the best option if you want to run the software in a container.
You need to have Docker
docker pull ocrd/tesserocr
To run with docker:
docker run -v path/to/workspaces:/data ocrd/tesserocr ocrd-tesserocrd-crop ...
If your operating system / distribution already provides Tesseract 4.1 or newer, then just install its development package:
# on Debian / Ubuntu:
sudo apt install libtesseract-dev
Otherwise, recent Tesseract packages for Ubuntu are available via PPA alex-p, which has up-to-date builds of Tesseract and its dependencies:
# on Debian / Ubuntu
sudo add-apt-repository ppa:alex-p/tesseract-ocr
sudo apt-get update
sudo apt install libtesseract-dev
Once Tesseract is available, just install ocrd_tesserocr from PyPI server:
pip install ocrd_tesserocr
We strongly recommend setting up a venv first.
Use this option if there is no suitable prebuilt version of Tesseract available on your system, or you want to change the source code or install the latest, unpublished changes.
git clone https://github.com/OCR-D/ocrd_tesserocr
cd ocrd_tesserocr
# install Tesseract:
sudo make deps-ubuntu # system dependencies just for the build
make deps
# install tesserocr and ocrd_tesserocr:
make install
We strongly recommend setting up a venv first.
Tesseract comes with synthetically trained models for languages (tesseract-ocr-{eng,deu,deu_latf,...}
or scripts (tesseract-ocr-script-{latn,frak,...}). In addition, various models
trained on scan data are available from the community.
Since all OCR-D processors must resolve file/data resources
in a standardized way,
and we want to stay interoperable with standalone Tesseract
(which uses a single compile-time tessdata directory),
ocrd-tesserocr-recognize expects the recognition models to be installed
in its module resource location only.
The module location is determined by the underlying Tesseract installation
(compile-time tessdata directory, or run-time $TESSDATA_PREFIX environment variable).
Other resource locations (data/system/cwd) will be ignored, and should not be used
when installing models with the Resource Manager (ocrd resmgr download).
To see the module resource location of your installation:
ocrd-tesserocr-recognize -D
For a full description of available commands for resource management, see:
ocrd resmgr --help
ocrd resmgr list-available --help
ocrd resmgr download --help
ocrd resmgr list-installed --help
Note: (In previous versions, the resource locations of standalone Tesseract and the OCR-D wrapper were different. If you already have models under
$XDG_DATA_HOME/ocrd-resources/ocrd-tesserocr-recognize, usually~/.local/share/ocrd-resources/ocrd-tesserocr-recognize, then consider moving them to the new default underocrd-tesserocr-recognize -D, usually/usr/share/tesseract-ocr/4.00/tessdata, or alternatively overriding the module directory by settingTESSDATA_PREFIX=$XDG_DATA_HOME/ocrd-resources/ocrd-tesserocr-recognizein the environment.)
Cf. OCR-D model guide.
Models always use the filename suffix .traineddata, but are just loaded by their basename.
You will need at least eng and osd installed (even for segmentation and deskewing),
probably also Latin and Fraktur etc. So to get minimal models, do:
ocrd resmgr download ocrd-tesserocr-recognize eng.traineddata
ocrd resmgr download ocrd-tesserocr-recognize osd.traineddata
(This will already be installed if using the Docker or git installation option.)
As of v0.13.1, you can configure ocrd-tesserocr-recognize to select models dynamically segment by segment,
either via custom conditions on the PAGE-XML annotation (presented as XPath rules),
or by automatically choosing the model with highest confidence.
For details, see docstrings in the individual processors
and ocrd-tool.json descriptions,
or simply --help.
Available OCR-D processors are:
Border of pages and adds AlternativeImage files to the output fileGrpoperation_level)
@orientation of regions or pages and adds AlternativeImage files to the output fileGrptiseg)
AlternativeImage files to the output fileGrpsegmentation_level and textequiv_level)
TextRegions, TableRegions, ImageRegions, MathsRegions, SeparatorRegions,
NoiseRegions, ReadingOrder and AlternativeImage to Page and sets their @orientation (optionally)TextRegions to TableRegions and sets their @orientation (optionally)TextLines to TextRegions (optionally)Words to TextLines (optionally)Glyphs to Words (optionally)TextEquivrecognize)
TextRegions, TableRegions, ImageRegions, MathsRegions, SeparatorRegions,
NoiseRegions, ReadingOrder and AlternativeImage to Page and sets their @orientationTextRegions to TableRegions and sets their @orientationTextLines to TextRegionsWords to TextLinesGlyphs to Wordsrecognize)
TextRegions, TableRegions, ImageRegions, MathsRegions, SeparatorRegions,
NoiseRegions and ReadingOrder to Page and sets their @orientationrecognize)
TextRegions to TableRegionsrecognize)
TextLines to TextRegionsrecognize)
Words to TextLinesTextStyle to WordsThe text region @types detected are (from Tesseract's PolyBlockType):
paragraph: normal block (aligned with others in the column)floating: unaligned block (is in a cross-column pull-out region)heading: block that spans more than one columncaption: block for text that belongs to an imageIf you are unhappy with these choices, then consider post-processing
with a dedicated custom processor in Python, or by modifying the PAGE files directly
(e.g. xmlstarlet ed --inplace -u '//pc:TextRegion/@type[.="floating"]' -v paragraph filegrp/*.xml).
All segmentation is currently done as bounding boxes only by default, i.e. without precise polygonal outlines. For dense page layouts this means that neighbouring regions and neighbouring text lines may overlap a lot. If this is a problem for your workflow, try post-processing like so:
ocrd-cis-ocropy-resegment for polygonalization,
or ocrd-cis-ocropy-clip on the line levelocrd-segment-repair with plausibilize
(and sanitize after line segmentation)It also means that Tesseract should be allowed to segment across multiple hierarchy levels at once, to avoid introducing inconsistent/duplicate text line assignments in text regions, or word assignments in text lines. Hence,
ocrd-tesserocr-recognize with segmentation_level=regionocrd-tesserocr-segment followed by ocrd-tesserocr-recognize,ocrd-tesserocr-recognize with segmentation_level=lineocrd-tesserocr-segment-line followed by ocrd-tesserocr-recognize,ocrd-tesserocr-segment over ocrd-tesserocr-segment-regionocrd-tesserocr-segment-table and) ocrd-tesserocr-segment-line,However, you can also run ocrd-tesserocr-segment* and ocrd-tesserocr-recognize
with shrink_polygons=True to get polygons by post-processing each segment,
shrinking to the convex hull of all its symbol outlines.
make test
This downloads some test data from https://github.com/OCR-D/assets under repo/assets,
and runs some basic test of the Python API as well as the CLIs.
Set PYTEST_ARGS="-s --verbose" to see log output (-s) and individual test results (--verbose).
FAQs
Unknown package
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Research
/Security News
Socket researchers discovered nine malicious NuGet packages that use time-delayed payloads to crash applications and corrupt industrial control systems.

Security News
Socket CTO Ahmad Nassri discusses why supply chain attacks now target developer machines and what AI means for the future of enterprise security.

Security News
Learn the essential steps every developer should take to stay secure on npm and reduce exposure to supply chain attacks.