
Research
/Security News
npm Author Qix Compromised in Major Supply Chain Attack
npm author Qix’s account was compromised, with malicious versions of popular packages like chalk-template, color-convert, and strip-ansi published.
node-ts-ocr
Advanced tools
A simple wrapper around command-line utils to assist in PDF / Image OCR (Optical Character Recognition) processing using Tesseract.
A simple wrapper around command-line utils to assist in PDF / Image OCR (Optical Character Recognition) processing using Tesseract.
npm install node-ts-ocr --save
After installing node.ts.ocr, the following binaries need to be on your system, as well as in the paths in your environment settings.
Many PDF's already have plain text embedded in them, either because they were born-digital (i.e. created from a word processing document) or because OCR was already performed on them. If we are able to extract the text using this utility we do not need to perform image conversion and subsequently OCR.
OSX
pdftotext
& pdfinfo
are included as part of the xpdf
utilities library.
brew install xpdf
Ubuntu
pdftotext
& pdfinfo
are included in the poppler-utils
library.
sudo apt-get install poppler-utils
CLI Example
Attempt to extract the text from a PDF:
pdftotext /path/to/document.pdf output.txt
A PDF is a jumble of instructions for how to render a document on a screen or page. Although it may contain images, a PDF is not itself an image, and therefore we can't perform OCR on it directly. To convert PDF's to images, we use ImageMagick's convert function which depends on Ghostscript.
OSX
brew install imagemagick
brew install gs
Ubuntu
sudo apt-get update
sudo apt-get install imagemagick --fix-missing
sudo apt-get install ghostscript
CLI Example
Convert a PDF to a TIFF representation:
convert -density 300 /path/to/document.pdf -depth 8 -strip -background white -alpha off image.tiff
Tesseract is Open Source OCR Engine.
OSX
brew install tesseract
Ubuntu
sudo apt-get install tesseract-ocr
CLI Example
Once we have a TIFF representation of the document, we can use Tesseract to (attempt to) extract the plain text:
tesseract image.tiff output.txt
import { Ocr } from 'node-ts-ocr';
import * as path from 'path';
import * as temp from 'temp';
export async function getPdfText(fileName: string): Promise<string> {
// Assuming your file resides in a directory named sample
const relativePath = path.join('sample', fileName);
const filePath = path.join(__dirname, relativePath);
// Extract the text and return the result
return await Ocr.extractText(filePath);
}
extractInfo(filePath: string)
Retrieve the pdf info using the pdfinfo binary and parse the result to a key value object.
extractText(filePath: string, options?: ExtractTextOptions)
Extracts the text from the pdf using the pdftotext binary
invokePdfToTiff(outDir: string, filePath: string, options?: ExtractTextOptions)
Converts a PDF file to its TIFF representation using the convert binary
invokeImageOcr(outDir: string, imagePath: string, options?: ExtractTextOptions)
Performs OCR on an image in order to extract the text using the tesseract binary
ExtractTextOptions
The arguments are key value pairs of valid command line arguments for the respective binary.
ExtractTextOptions {
pdfToTextArgs?: KeyValue;
convertArgs?: KeyValue;
tesseractArgs?: KeyValue;
}
Example pdfToTextArgs
that only includes page 1 to 4.
Note: this will only work if you already have a searchable PDF, because the pdftotext
binary can only be used to extract text from a searchable PDF.
{ pdfToTextArgs: { f: 1, l: 4 } }
Example convertArgs
that sets the convert density to 600, and the trim option to on.
{ convertArgs: { density: '600', trim: '' } }
Example tesseractArgs
that sets the language to english, the page segmentation mode to 6, and preserves interword spaces.
{ tesseractArgs: { 'l': 'eng', '-psm': 6, 'c': 'preserve_interword_spaces=1' } }
Coming Soon...
FAQs
A simple wrapper around command-line utils to assist in PDF / Image OCR (Optical Character Recognition) processing using Tesseract.
The npm package node-ts-ocr receives a total of 0 weekly downloads. As such, node-ts-ocr popularity was classified as not popular.
We found that node-ts-ocr demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
/Security News
npm author Qix’s account was compromised, with malicious versions of popular packages like chalk-template, color-convert, and strip-ansi published.
Research
Four npm packages disguised as cryptographic tools steal developer credentials and send them to attacker-controlled Telegram infrastructure.
Security News
Ruby maintainers from Bundler and rbenv teams are building rv to bring Python uv's speed and unified tooling approach to Ruby development.