Docstruct
Overview
Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.
Documentation
For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)
pip install docstruct
Contributions
Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.
License
The Docstruct package is licensed under the MIT License.