Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

@phiresky/ocr-pdf-via-document-ai

Package Overview
Dependencies
Maintainers
1
Versions
5
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

@phiresky/ocr-pdf-via-document-ai

  • 0.0.4
  • latest
  • npm
  • Socket score

Version published
Maintainers
1
Created
Source

ocr-pdf-via-document-ai

Takes a set of jpg files, runs OCR on them via Google Cloud Document AI and outputs

  • A plain text file per input image
  • The raw docai JSON output
  • A HOCR file per input image (can be opened as a standalone HTML file via hocrjs)
  • A PDF file with an invisible text layer to make it searchable

Example:

Input ImageDebug Output

Why a cloud service?

Tesseract is the best open-source OCR engine, but sadly the Google Service performs much better and has the following extra features:

  • Supports any number of languages mixed together (also outputs per-line language confidence scores)
  • Detects and corrects any page orientation
  • High recognition accuracy with different fonts and even handwriting
  • In theory the option for advanced capabilities such as table extraction / form extraction

Cost

Currently the price is $1.5 per 1000 pages (see here).

Installation and Running

You'll need to set up the following env variables:

  • GOOGLE_APPLICATION_CREDENTIALS: (only if you don't have default credentials) path to a json file giving access to Document AI
  • API_ENDPOINT: e.g. eu-documentai.googleapis.com
  • PROCESSOR_NAME: something like projects/00000000000/locations/eu/processors/fffffffffffff

Then run:

yarn install
yarn run ocr --writePdf out.pdf --writeTxt --input input_dir/*.jpg

The aw document-ai output is always written so if you run it again it will only do API calls for new files.

TODO

  • Package as npm package, fix startup time
  • Ensure PDF output has all the text in the "correct" order when run through pdftotext (also compare with hocr-pdf)

FAQs

Package last updated on 06 Mar 2023

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc