What is @aws-sdk/client-textract?
@aws-sdk/client-textract is an AWS SDK for JavaScript package that allows developers to interact with Amazon Textract, a service that automatically extracts text and data from scanned documents. This package provides methods to analyze documents, detect text, and extract structured data from forms and tables.
What are @aws-sdk/client-textract's main functionalities?
Detect Document Text
This feature allows you to detect text in a document. The code sample demonstrates how to use the DetectDocumentTextCommand to extract text from a document provided as a byte array.
const { TextractClient, DetectDocumentTextCommand } = require('@aws-sdk/client-textract');
const client = new TextractClient({ region: 'us-west-2' });
const params = {
Document: {
Bytes: Buffer.from('...') // Replace with your document bytes
}
};
const run = async () => {
try {
const data = await client.send(new DetectDocumentTextCommand(params));
console.log(data);
} catch (err) {
console.error(err);
}
};
run();
Analyze Document
This feature allows you to analyze a document for tables and forms. The code sample demonstrates how to use the AnalyzeDocumentCommand to extract structured data from a document provided as a byte array.
const { TextractClient, AnalyzeDocumentCommand } = require('@aws-sdk/client-textract');
const client = new TextractClient({ region: 'us-west-2' });
const params = {
Document: {
Bytes: Buffer.from('...') // Replace with your document bytes
},
FeatureTypes: ['TABLES', 'FORMS']
};
const run = async () => {
try {
const data = await client.send(new AnalyzeDocumentCommand(params));
console.log(data);
} catch (err) {
console.error(err);
}
};
run();
Start Document Text Detection
This feature allows you to start an asynchronous job to detect text in a document stored in an S3 bucket. The code sample demonstrates how to use the StartDocumentTextDetectionCommand to initiate the text detection process.
const { TextractClient, StartDocumentTextDetectionCommand } = require('@aws-sdk/client-textract');
const client = new TextractClient({ region: 'us-west-2' });
const params = {
DocumentLocation: {
S3Object: {
Bucket: 'your-bucket-name',
Name: 'your-document-name'
}
}
};
const run = async () => {
try {
const data = await client.send(new StartDocumentTextDetectionCommand(params));
console.log(data);
} catch (err) {
console.error(err);
}
};
run();
Other packages similar to @aws-sdk/client-textract
tesseract.js
Tesseract.js is a JavaScript library that provides optical character recognition (OCR) capabilities. It can extract text from images and scanned documents. Unlike @aws-sdk/client-textract, which is a cloud-based service, Tesseract.js runs entirely in the browser or on the server, making it a good choice for offline applications.
ocr-space-api-wrapper
ocr-space-api-wrapper is a Node.js wrapper for the OCR.space API, which provides OCR capabilities for extracting text from images and PDFs. Similar to @aws-sdk/client-textract, it is a cloud-based service, but it offers a different set of features and pricing models.
pdf2json
pdf2json is a Node.js library that extracts text and metadata from PDF files. While it does not offer the same level of structured data extraction as @aws-sdk/client-textract, it is useful for basic text extraction from PDFs.