Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More →

@phiresky/ocr-pdf-via-document-ai

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

@phiresky/ocr-pdf-via-document-ai

0.0.1
npm

Version published: 2 years ago

Weekly downloads: 0; decreased by-100%

Maintainers: 1

Weekly downloads

Created: 2 years ago

Source

ocr-pdf-via-document-ai

Takes a set of jpg files, runs OCR on them via Google Cloud Document AI and outputs

A plain text file per input image
The raw docai JSON output
A HOCR file per input image (can be opened as a standalone HTML file via hocrjs)
A PDF file with an invisible text layer to make it searchable

Example:

Input Image	Debug Output

Why a cloud service?

Tesseract is the best open-source OCR engine, but sadly the Google Service performs much better and has the following extra features:

Supports any number of languages mixed together (also outputs per-line language confidence scores)
Detects and corrects any page orientation
High recognition accuracy with different fonts and even handwriting
In theory the option for advanced capabilities such as table extraction / form extraction

Cost

Currently the price is $1.5 per 1000 pages (see here).

Installation and Running

You'll need to set up the following env variables:

GOOGLE_APPLICATION_CREDENTIALS: (only if you don't have default credentials) path to a json file giving access to Document AI
API_ENDPOINT: e.g. eu-documentai.googleapis.com
PROCESSOR_NAME: something like projects/00000000000/locations/eu/processors/fffffffffffff

Then run:

yarn install

yarn run ocr --writePdf out.pdf --writeTxt --input input_dir/*.jpg

The aw document-ai output is always written so if you run it again it will only do API calls for new files.

TODO

Package as npm package, fix startup time
Ensure PDF output has all the text in the "correct" order when run through pdftotext (also compare with hocr-pdf)

FAQs

What is @phiresky/ocr-pdf-via-document-ai?

Is @phiresky/ocr-pdf-via-document-ai popular?

Is @phiresky/ocr-pdf-via-document-ai well maintained?

Package last updated on 06 Mar 2023

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

@phiresky/ocr-pdf-via-document-ai

ocr-pdf-via-document-ai

Why a cloud service?

Cost

Installation and Running

TODO

Related posts

Input Validation Vulnerabilities Dominate MITRE's 2024 CWE Top 25 List

Risky Business Podcast: Why Open Source Software Needs Better Malware Tracking