docstruct

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

1.0.251
PyPI

Maintainers: 2

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

FAQs

What is docstruct?

Is docstruct well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

docstruct

Docstruct

Overview

Documentation

Contributions

License

Related posts

Malicious npm Campaign Targets Ethereum Developers with Fake Hardhat Packages

Quasar RAT Disguised as an npm Package for Detecting Vulnerabilities in Ethereum Smart Contracts