
Security News
rv Is a New Rust-Powered Ruby Version Manager Inspired by Python's uv
Ruby maintainers from Bundler and rbenv teams are building rv to bring Python uv's speed and unified tooling approach to Ruby development.
amazon-textract-textractor
Advanced tools
Textractor is a python package created to seamlessly work with Amazon Textract a document intelligence service offering text recognition, table extraction, form processing, and much more. Whether you are making a one-off script or a complex distributed document processing pipeline, Textractor makes it easy to use Textract.
If you are looking for the other amazon-textract-* packages, you can find them using the links below:
Textractor is available on PyPI and can be installed with pip install amazon-textract-textractor
. By default this will install the minimal version of Textractor which is suitable for lambda execution. The following extras can be used to add features:
pandas
(pip install "amazon-textract-textractor[pandas]"
) installs pandas which is used to enable DataFrame and CSV exports.pdfium
(pip install amazon-textract-textractor[pdfium]
) includes pypdfium2
and is the recommended way to enable PDF rasterization in Textractor. Note that this is not necessary to call Textract with a PDF file.pdf
(pip install amazon-textract-textractor[pdf]
) includes pdf2image
and is an additional way to enable PDF rasterization in Textractor. Note that this is not necessary to call Textract with a PDF file.torch
(pip install "amazon-textract-textractor[torch]"
) includes sentence_transformers
for better word search and matching. This will work on CPU but be noticeably slower than non-machine learning based approaches.dev
(pip install "amazon-textract-textractor[dev]"
) includes all the dependencies above and everything else needed to test the code.You can pick several extras by separating the labels with commas like this pip install "amazon-textract-textractor[pdf,torch]"
.
Generated documentation for the latest released version can be accessed here: aws-samples.github.io/amazon-textract-textractor/
While a collection of simplistic examples is presented here, the documentation has a much larger collection of examples with specific case studies that will help you get started.
These two lines are all you need to use Textract. The Textractor instance can be reused across multiple requests for both synchronous and asynchronous requests.
from textractor import Textractor
extractor = Textractor(profile_name="default")
# file_source can be an image, list of images, bytes or S3 path
document = extractor.detect_document_text(file_source="tests/fixtures/single-page-1.png")
print(document.lines)
#[Textractor Test, Document, Page (1), Key - Values, Name of package: Textractor, Date : 08/14/2022, Table 1, Cell 1, Cell 2, Cell 4, Cell 5, Cell 6, Cell 7, Cell 8, Cell 9, Cell 10, Cell 11, Cell 12, Cell 13, Cell 14, Cell 15, Selection Element, Selected Checkbox, Un-Selected Checkbox]
from textractor.data.constants import TextractFeatures
document = extractor.analyze_document(
file_source="tests/fixtures/form.png",
features=[TextractFeatures.TABLES]
)
# Saves the table in an excel document for further processing
document.tables[0].to_excel("output.xlsx")
from textractor.data.constants import TextractFeatures
document = extractor.analyze_document(
file_source="tests/fixtures/form.png",
features=[TextractFeatures.FORMS]
)
# Use document.get() to search for a key with fuzzy matching
document.get("email")
# [E-mail Address : johndoe@gmail.com]
document = extractor.analyze_id(file_source="tests/fixtures/fake_id.png")
print(document.identity_documents[0].get("FIRST_NAME"))
# 'MARIA'
document = extractor.analyze_expense(file_source="tests/fixtures/receipt.jpg")
print(document.expense_documents[0].summary_fields.get("TOTAL")[0].text)
# '$1810.46'
If your use case was not covered here or if you are looking for asynchronous usage examples, see our collection of examples.
Textractor also comes with the textractor
script, which supports calling, printing and overlaying directly in the terminal.
textractor analyze-document tests/fixtures/amzn_q2.png output.json --features TABLES --overlay TABLES
See the documentation for more examples.
The package comes with tests that call the production Textract APIs. Running the tests will incur charges to your AWS account.
This library was made possible by the work of Srividhya Radhakrishna (@srividh-r).
See CONTRIBUTING.md
Textractor can be cited using:
@software{amazontextractor,
author = {Belval, Edouard and Delteil, Thomas and Schade, Martin and Radhakrishna, Srividhya},
title = {{Amazon Textractor}},
url = {https://github.com/aws-samples/amazon-textract-textractor},
version = {1.9.2},
year = {2025}
}
Or using the CITATION.cff file.
This library is licensed under the Apache 2.0 License.
Excavator image by macrovector on Freepik
FAQs
A package to use AWS Textract services.
We found that amazon-textract-textractor demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 4 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Ruby maintainers from Bundler and rbenv teams are building rv to bring Python uv's speed and unified tooling approach to Ruby development.
Security News
Following last week’s supply chain attack, Nx published findings on the GitHub Actions exploit and moved npm publishing to Trusted Publishers.
Security News
AGENTS.md is a fast-growing open format giving AI coding agents a shared, predictable way to understand project setup, style, and workflows.