This library loads Amazon Textract API response JSONs into structured classes with helper methods, for easier post-processing.
It's designed to work in both NodeJS and browser environments, and to support projects in either JavaScript or TypeScript.
⚠️ Warning: If you're migrating from another TRP implementation such as the Textract Response Parser for Python, please note that the APIs and available features may be substantially different. Please let us know if there's a feature you're missing!
Installation
You can use TRP in your JavaScript or TypeScript NPM projects:
$ npm install amazon-textract-response-parser
const { TextractDocument, TextractIdentity } = require("amazon-textract-response-parser");
import { TextractDocument, TextractExpense } from "amazon-textract-response-parser";
...Or link directly in the browser - for example via a CDN like unpkg:
<script src="https://unpkg.com/amazon-textract-response-parser@x.y.z"></script>
<script>
var doc = new trp.TextractDocument(...);
var avg = trp.aggregate([1, 2, 3], trp.AggregationMethod.Mean);
</script>
To enable this, the distribution of this library provides multiple builds:
dist/cjs (default main), for CommonJS environments like NodeJS - including most front end applications built with tools like React and Webpack.
dist/es (default module), for ES6/ES2015/esnext capable environments.
dist/browser (default jsdelivr and unpkg), for linking directly from browser HTML with no module framework (IIFE).
This means that deep imports will depend on your build environment, but are generally discouraged anyway and may not work correctly with TypeScript. Check out the examples/ folder on GitHub for some basic starters using the different styles.
Loading data
Initialize a TextractDocument (or TextractExpense, TextractIdentity) by providing the parsed response JSON object from the corresponding Amazon Textract APIs such as GetDocumentAnalysis, AnalyzeID, or AnalyzeExpense. In most cases, providing an array of response objects is also supported (for use when a large Amazon Textract response was split/paginated).
For example, loading a response JSON from file in NodeJS:
fs.readFile("./my-analyze-document-response.json", (err, resBuffer) => {
if (err) throw err;
const doc = new TextractDocument(JSON.parse(resBuffer));
});
If you're using TypeScript, you may need to typecast your input JSON while loading it.
The ApiResponsePage input interface exposed and expected by this module is more constrained than - but functionally compatible with - the output types produced by the AWS SDK for JavaScript Textract Client.
import { ApiAnalyzeExpenseResponse } from "amazon-textract-response-parser";
import { TextractClient, AnalyzeExpenseCommand } from "@aws-sdk/client-textract";
const textract = new TextractClient({});
async function main() {
const textractResponse = await textract.send(
new AnalyzeExpenseCommand({
Document: { Bytes: await fs.readFile("...") },
})
);
const expense = new TextractExpense((textractResponse as unknown) as ApiAnalyzeExpenseResponse);
}
With your data loaded in to a TRP TextractDocument or similar, you're ready to take advantage of the higher-level TRP.js functions to navigate and analyze the result.
Generic document text navigation
In general, this library avoids directly exposing arrays in results (see the Mutation operations section below). Instead, you can use:
.n*** properties to count items
.list***() functions to return a copy of the underlying array
.iter***() functions to iterate through collections, or
.***At***() functions to fetch a specific item from a collection
For example:
console.log(`Opened doc with ${doc.nPages} pages`);
console.log(
`The first word of the first line is ${doc.pageNumber(1).lineAtIndex(0).wordAtIndex(0).text}`
);
for (const page of doc.iterPages()) {
for (const line of page.iterLines()) {
for (const word of line.iterWords()) {
console.log(word.text);
}
}
}
const linesArrsByPage = doc.listPages().map((p) => p.listLines());
These arrays are in the raw order returned by Amazon Textract, which is not necessarily a logical human reading order - especially for multi-column documents. See the Layout analysis and List text in approximate reading order sections below for extra content sorting utilities.
Queries
The results of Amazon Textract Queries are accessible at the page level under page.queries. You can get* a query by exact question text or alias, or search* them by case-insensitive substrings:
doc.listPages().forEach((page) => {
console.log(page.queries.str());
const customer = page.queries.getQueryByAlias("customer_name")?.topResult?.text;
const shippingAddrCandidates =
page.queries.getQueryByAlias("shipping_addr")?.listResultsByConfidence() || [];
const shippingAddrTopConf = shippingAddrCandidates[0].confidence;
const addrQueries = page.queries.searchQueriesByQuestion("address");
});
Forms (Key-Value pairs)
As well as looping through the form data key-value pairs in the document, you can query fields by key:
console.log(doc.form.nFields);
const fields = doc.form.listFields();
const addr = doc.form.getFieldByKey("Address").value?.text;
const addresses = doc.form.searchFieldsByKey("address");
addresses.forEach((addrField) => { console.log(addrField.key.text); });
Note that the Field.confidence, FieldKey.confidence and FieldValue.confidence scores reflect confidence of the key-value structure detection model. For aggregated OCR confidence of their actual text, use .getOcrConfidence() instead.
You can also search form keys at the individual page level, or look up the page number for detected fields:
const fieldByDoc = doc.form.getFieldByKey("Address");
console.log(`Detected Address on page ${fieldByDoc.parentPage.pageNumber}`);
const page = doc.pageNumber(1);
const fieldByPage = page.form.getFieldByKey("Address");
field.isCheckbox is true for fields whose value contain exactly one SelectionElement object: meaning they're a (key=label)->(value=checkbox/radio) pair. For these fields, you can directly use field.selectionStatus or field.isSelected to look up the value's status. For other (non-checkbox) fields, they'll return null.
Tables
This library's table navigation tools address merged cells by default, for convenience.
console.log(page.nTables);
const table = page.tableAtIndex(0);
const headerStrs = table.cellsAt(1, null)?.map(cell => cell.text);
const firstColCells = table.cellsAt(null, 1);
const targetCell = table.cellAt(2, 4);
for (const row of table.iterRows()) {
for (const cell of row.iterCells()) {
console.log(cell.text);
}
}
Further configuration arguments can be used to change the treatment of merged cells if needed:
for (const row of table.iterRows({repeatMultiRowCells: true})) {}
const firstColCellFragments = table.cellsAt(null, 1, {ignoreMerged: true});
The Table.confidence, Row.getConfidence() and Cell.confidence scores reflect confidence of the table structure detection model. For aggregated OCR confidence of the text contained inside, use .getOcrConfidence() instead.
Use Table.tableType and Cell.hasEntityTypes() to explore the more advanced entity types extracted by Amazon Textract: For example column headers, title cells, footer cells, and summary cells:
import { ApiTableCellEntityType, ApiTableEntityType } from "amazon-textract-response-parser";
const isSemiStruct = table.tableType === ApiTableEntityType.SemiStructuredTable;
const colHeaders = table.rowAt(1).listCells()
.filter((c) => c.hasEntityTypes(ApiTableCellEntityType.ColumnHeader));
For overall table-level title and footer captions, see table.listTitles() and table.listFooters(), etc.
Layout analysis
Layout analysis in Amazon Textract detects higher-level semantic components than the core text Lines & Words - like paragraphs and headings. If you enabled this analysis, you can access the results through the page.layout collection:
page.layout.listItems().forEach((layItem) => {
console.log(layItem.blockType);
const textLines = layItem.listTextLines();
const children = layItem.listContent();
console.log(layItem.text + "\n");
});
for (const layItem of page.layout.listItems({
skipBlockTypes: [
ApiBlockType.LayoutHeader, ApiBlockType.LayoutFooter, ApiBlockType.LayoutPageNumber
],
})) {
console.log(layItem.text);
}
If Forms and/or Tables analyses were also enabled, you'll be able to traverse from the relevant Layout object types to these more detailed representations. However, because these are separate analyses the correspondence may not be 1-to-1 and TRP is having to do some reconciliation under the hood:
import { ApiBlockType, LayoutKeyValue, LayoutTable } from "amazon-textract-response-parser";
page.layout.listItems().forEach((layItem) => {
if (layItem.blockType === ApiBlockType.LayoutKeyValue) {
const fields = (layItem as LayoutKeyValue).listFields();
fields.forEach((field) => console.log(field.key.text));
} else if (layItem.blockType === ApiBlockType.LayoutTable) {
const tables = (layItem as LayoutTable).listTables();
tables.forEach((table) => console.log(table.nCells));
}
});
List text in approximate reading order (with or without Layout)
Particularly for multi-column documents, the default output sequence for Amazon Textract LINE/WORD OCR results will likely not be the overall reading order you'd like. For best performance, enable and use the Layout analysis because layout items are returned in implied reading order as estimated by the AI service.
Alternatively, TRP.js provides a client-side heuristic algorithm that can attempt to sort results without Layout. There are even some configuration parameters exposed to help you tune the results for your particular domain, and test harnesses in the tests/unit/corpus folder to help you experiment via npm run test:unit:
import { ReadingOrderLayoutMode } from "amazon-textract-response-parser";
let textInReadingOrder: string = page.getTextInReadingOrder();
let pseudoParas = page.getLineClustersInReadingOrder();
let layText = page.getTextInReadingOrder({ useLayout: ReadingOrderLayoutMode.RequireLayout });
let layParas = page.getLineClustersInReadingOrder({
colHOverlapThresh = 0.75,
paraVDistTol = 0.8,
useLayout: ReadingOrderLayoutMode.IgnoreLayout,
});
for (const pseudoParagraph of pseudoParas) {
for (const line of pseudoParagraph) {
console.log(line.text);
}
console.log();
}
When configured to use Layout analysis results, these functions should be equivalent to just looping through your page.layout.iterItems() to get the text from each one in order.
Render documents to semantic markup/markdown
If you'd like to use AI/ML models to further post-process your Amazon Textract results, you have a choice between those that take text-only inputs - and "multi-modal" models that can also ingest structural information (see for example this Amazon Comprehend feature and this Amazon SageMaker sample). While multi-modal models may work best on complex structured documents, the pace of research on text-only Large Language Models has historically been faster (perhaps because plain text data is easier to come by and work with).
Semantic markup like HTML provides somewhat of a middle ground where we can try to preserve the layout/form/table/etc structure Amazon Textract extracted, but still provide plain text. This may be particularly useful for working with Generative Large Language Models (GenAI/LLMs) like those on Amazon Bedrock.
console.log(page.listTables[0].html());
const docHtml = doc.html();
fs.writeFile("./my-doc.html", docHtml, (err) => {});
Some caveats to be aware of:
- Top-level
Page.html() and TextractDocument.html() currently depend on Layout analysis being enabled, because the Layout results are used to sequence all the elements together.
- Only HTML is supported currently, but we're keen to add
.markdown() if there's interest
You can also filter out types of content you don't want to include in your HTML.
const docHtml = doc.html({
skipBlockTypes: [
ApiBlockType.LayoutHeader, ApiBlockType.LayoutFooter, ApiBlockType.LayoutPageNumber
],
});
const docHtmlNoCellsOrSelectors = doc.html({
skipBlockTypes: [ApiBlockType.Cell, ApiBlockType.SelectionElement],
});
const docTablesHtml = doc.html({
includeBlockTypes: [
ApiBlockType.Page,
ApiBlockType.LayoutTable,
ApiBlockType.Table,
ApiBlockType.Cell,
ApiBlockType.SelectionElement,
ApiBlockType.Word,
],
});
If you have feedback about these features, please let us know in the GitHub issues to help prioritise!
Segment headers and footers from main content
This is another task for which you might find Textract Layout analysis useful - by looping through layout items and filtering out those of type LayoutHeader, LayoutFooter and PageNumber.
However, TRP.js also provides a heuristic function you can try instead:
const segmented = page.getLinesByLayoutArea(
true
);
console.log("---- HEADER:")
console.log(segmented.header.map((l) => l.text).join("\n"));
console.log("\n---- CONTENT:")
console.log(segmented.content.map((l) => l.text).join("\n"));
console.log("\n---- FOOTER:")
console.log(segmented.footer.map((l) => l.text).join("\n"));
Note: Unlike the *inReadingOrder APIs, this utility has not yet been updated to use Textract Layout analysis when it's available. That behavior might change in future.
Calculate average skew of page text
Calculating the overall skew of a page can be useful for validation checks: For example to detect and reject a strongly skewed image which might degrade the accuracy of tables, forms, or other downstream analyses.
const skew = page.getModalWordOrientationDegrees();
This method aggregates the skew to find the most common angle across all content on the page.
Signatures
If you enabled signature detection in Amazon Textract, you can check for signatures at the page level:
doc.listPages()
.forEach((page, ix) => { console.log(`${page.nSignatures} signatures on page ${ix+1}`); });
const bbox = doc.pageNumber(1).listSignatures()[0].geometry.boundingBox;
Expense (invoice and receipt) objects
Since the format of responses for Amazon Textract's Expense results is very different from the general document analysis APIs, you can use the separate TextractExpense class in this library to process these.
const expense = new TextractExpense(textractResponse);
console.log(`Found ${expense.nDocs} expense docs in file`);
const expenseDoc = [...expense.iterDocs()][0];
for (const group of expenseDoc.iterLineItemGroups()) {
for (const item of group.iterLineItems()) {
console.log(`Found line item with ${item.nFields} fields`);
for (const field of item.iterFields()) {
...
}
}
}
const summaryFieldsArrByDoc = expense.listDocs().map((doc) => doc.listSummaryFields());
const linesArrsByPage = doc.listPages().map((p) => p.listLines())
const vendorNameFields = expenseDoc.searchSummaryFieldsByType("VENDOR_NAME");
console.log(`Found ${vendorNameFields.length} vendor name fields in doc summary`);
console.log(vendorNameFields[0].fieldType.text);
console.log(vendorNameFields[0].value.text);
Identity document objects
Similarly to expenses mentioned above, Amazon Textract offers specific APIs for identity document analysis. You can use the separate TextractIdentity class in this library to process these.
import { ApiAnalyzeIdResponse, TextractIdentity } from "amazon-textract-response-parser";
import { TextractClient, AnalyzeIDCommand } from "@aws-sdk/client-textract";
const textract = new TextractClient({});
async function main() {
const textractResponse = await textract.send(
new AnalyzeIDCommand({
Document: { Bytes: await fs.readFile("...") },
})
);
const identity = new TextractIdentity((textractResponse as unknown) as ApiAnalyzeIdResponse);
}
The library implements some enumerations of known values (for field types, ID types, and so on) to make processing AnalyzeID responses a little simpler:
import { IdDocumentType, IdFieldType } from "amazon-textract-response-parser";
const idDoc = identity.getDocAtIndex(0);
if (idDoc.idType === IdDocumentType.Passport) {
const passNumField = idDoc.getFieldByType(IdFieldType.DocumentNumber);
console.log(
`Passport number ${passNumField.value}, confidence ${passNumField.valueConfidence}%`
);
} else if (idDoc.idType === IdDocumentType.DrivingLicense) {
for (const field of idDoc.iterFields()) {
console.log(`${field.fieldTypeRaw}: ${field.valueRaw}`);
}
} else {
console.log(idDoc.str());
}
Mutation operations
Easier analysis and querying of Textract results is useful, but what if you want to augment or edit your Textract JSONs with JS/TS Textract Response Parser?
In general:
- Where the library classes (
TextractDocument, Page, Word, etc) offer mutation operations, these should modify the source API JSON object in-place and ensure self-consistency.
- For library classes that are backed by a specific object in the source API JSON, you can access it via the
.dict property (word.dict, table.dict, etc) but then you're responsible for updating any required references in other objects if making changes there.
- Any individual-block-level changes you make to the underlying API JSON should be dynamically reflected in the parsed TRP objects (e.g. overriding word text, coordinates, etc)... But changes that affect inter-block relationships are more likely to cause staleness issues.
In particular for array properties, you'll note that TRP generally exposes getters and iterators (such as table.nRows, table.iterRows(), table.listRows(), table.cellsAt()) rather than direct access to lists - to avoid implying that arbitrary array mutations (such as table.rows.pop()) are properly supported.
Other features and examples
For more examples on how to use the library, you can refer to the (basic) examples and (more complete) test folders on GitHub, and the source code itself. If you have suggestions for additional features that would be useful, please open a GitHub issue!
Development
The integration tests for this library validate the end-to-end toolchain for calling Amazon Textract and parsing the result, so note that to run the full npm run test command:
- Your environment will need to be configured with a login to AWS (e.g. via the AWS CLI)
- Billable API requests may be made
You can alternatively run just the local/unit tests via npm run test:unit.