Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

amazon-textract-prettyprinter

Package Overview
Dependencies
Maintainers
4
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

amazon-textract-prettyprinter

Amazon Textract Helper tools for pretty printing

  • 0.1.10
  • PyPI
  • Socket score

Maintainers
4

Textract-PrettyPrinter

Provides functions to format the output received from Textract in more easily consumable formats incl. CSV or Markdown. amazon-textract-prettyprinter

Install

> python -m pip install amazon-textract-prettyprinter

Make sure your environment is setup with AWS credentials through configuration files or environment variables or an attached role. (https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)

Samples

Get FORMS and TABLES as CSV

from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import Pretty_Print_Table_Format, Textract_Pretty_Print, get_string

textract_json = call_textract(input_document=input_document, features=[Textract_Features.FORMS, Textract_Features.TABLES])
print(get_string(textract_json=textract_json,
               table_format=Pretty_Print_Table_Format.csv,
               output_type=[Textract_Pretty_Print.TABLES, Textract_Pretty_Print.FORMS]))

Get string for TABLES using the get_string method

from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import Textract_Pretty_Print, get_string

textract_json = call_textract(input_document=input_document, features=[Textract_Features.TABLES])
get_string(textract_json=textract_json, output_type=Textract_Pretty_Print.TABLES)

Print out tables in LaTeX format

from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import Textract_Pretty_Print, get_string

textract_json = call_textract(input_document=input_document, features=[Textract_Features.FORMS, Textract_Features.TABLES])
get_tables_string(textract_json=textract_json, table_format=Pretty_Print_Table_Format.latex)

Get linearized text from LAYOUT using get_text_from_layout_json method

Generates a dictionary of linearized text from the Textract JSON response with LAYOUT, and optionally writes linearized plain text files to local file system or Amazon S3. It can take either per page JSON from AnalyzeDocument API, or a single combined JSON with all the pages created from StartDocumentAnalysis output JSONs.

from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import get_text_from_layout_json

textract_json = call_textract(input_document=input_document, features=[Textract_Features.LAYOUT, Textract_Features.TABLES])
layout = get_text_from_layout_json(textract_json=textract_json)

full_text = layout[1]
print(full_text)

In addition to textract_json, the get_text_from_layout_json function can take the following additional parameters

  • table_format (str, optional): Format of tables within the document. Supports all python-tabulate table formats. See tabulate for supported table formats. Defaults to grid.
  • exclude_figure_text (bool, optional): If set to True, excludes text extracted from figures in the document. Defaults to False.
  • exclude_page_header (bool, optional): If set to True, excludes the page header from the linearized text. Defaults to False.
  • exclude_page_footer (bool, optional): If set to True, excludes the page footer from the linearized text. Defaults to False.
  • exclude_page_number (bool, optional): If set to True, excludes the page number from the linearized text. Defaults to False.
  • skip_table (bool, optional): If set to True, skips including the table in the linearized text. Defaults to False.
  • save_txt_path (str, optional): Path to save the output linearized text to files. Either a local file system path or Amazon S3 path can be specified in s3://bucket_name/prefix/ format. Files will be saved with <page_number>.txt naming convention.
  • generate_markdown (bool, optional): If set to True, generates markdown formatted linearized text. Defaults to False.

Generate the layout.csv similar to the Textract Web Console

Customers asked for the abilility to generate the layout.csv format, which can be downloaded when testing documents in the AWS Web Console. The method ``get_layout_csv_from_trp2```` generates for each page a list of the entries:

'Page number,'Layout,'Text,'Reading Order,'Confidence score

  • Page number: starting at 1, incrementing for eac page
  • Layout: the BlockType + a number indicating the sequence for this BlockType starting at 1 and for LAYOUT_LIST elements the string: "- part of LAYOUT_LIST (index)" is added
  • Text: except for LAYOUT_LIST and LAYOUT_FIGURE the underlying text
  • Reading Order: increasing int for each LAYOUT element starting with 0
  • Confidence score: confidence in this being a LAYOUT element

this can be used to generate a CSV (or another format). Below a sample how to generate a CSV.

# taken from the test
# generates the CSV in memory
from textractprettyprinter import get_layout_csv_from_trp2

with open(<some_test_file>) as input_fp:
    trp2_doc: TDocument = TDocumentSchema().load(json.load(input_fp))
    layout_csv = get_layout_csv_from_trp2(trp2_doc)
    csv_output = io.StringIO()
    csv_writer = csv.writer(csv_output, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for page in layout_csv:
        csv_writer.writerows(page)
    print(csv_output)

Sample output

Page numberLayoutTextReading OrderBlockTypeConfidence score
1LAYOUT_SECTION_HEADER 1Amazing Headline!...0LAYOUT_SECTION_HEADER81.25
1LAYOUT_TEXT 1Lorem ipsum dolor sit amet, co...1LAYOUT_TEXT99.755859375
1LAYOUT_SECTION_HEADER 2Unbelievable stuff...2LAYOUT_SECTION_HEADER90.478515625
1LAYOUT_TEXT 2Ut ultrices felis vel mi susci...3LAYOUT_TEXT98.486328125
1LAYOUT_LIST 14LAYOUT_LIST97.16796875
1LAYOUT_TEXT 3 - part of LAYOUT_LIST 1Priority list item 1...5LAYOUT_TEXT97.8515625
1LAYOUT_TEXT 4 - part of LAYOUT_LIST 1Priority list item 2...6LAYOUT_TEXT98.095703125
1LAYOUT_TEXT 5 - part of LAYOUT_LIST 1Another list item 3...7LAYOUT_TEXT98.095703125
1LAYOUT_TEXT 6 - part of LAYOUT_LIST 1And a total optional list item...8LAYOUT_TEXT98.73046875
1LAYOUT_LIST 29LAYOUT_LIST69.53125
1LAYOUT_TEXT 7 - part of LAYOUT_LIST 21. But we...10LAYOUT_TEXT95.751953125
1LAYOUT_TEXT 8 - part of LAYOUT_LIST 22. can also...11LAYOUT_TEXT96.923828125
1LAYOUT_TEXT 9 - part of LAYOUT_LIST 23. do numbered...12LAYOUT_TEXT97.36328125
1LAYOUT_TEXT 10 - part of LAYOUT_LIST 24. lists...13LAYOUT_TEXT96.6796875
1LAYOUT_TEXT 11congue ac. Phasellus mollis co...14LAYOUT_TEXT96.044921875
1LAYOUT_TEXT 12Quisque a elementum diam. Null...15LAYOUT_TEXT96.484375
1LAYOUT_TEXT 13Table Caption 1...16LAYOUT_TEXT86.865234375
1LAYOUT_TABLE 1Date Description Amount 12-12-...17LAYOUT_TABLE96.435546875
1LAYOUT_TEXT 14Quisque dapibus varius ipsum, ...18LAYOUT_TEXT93.06640625
1LAYOUT_FIGURE 119LAYOUT_FIGURE94.3359375
1LAYOUT_TEXT 15Figure Caption 1...20LAYOUT_TEXT63.18359375

Keywords

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc