Socket
Socket
Sign inDemoInstall

pdf-ocr-ts

Package Overview
Dependencies
450
Maintainers
1
Versions
13
Alerts
File Explorer

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

    pdf-ocr-ts

Javascript-only library to perform OCR on scanned PDFs to turn them into searchable PDFs


Version published
Weekly downloads
6
decreased by-53.85%
Maintainers
1
Created
Weekly downloads
 

Readme

Source

pdf-ocr-ts

pdf-ocr-ts creates searchable PDF files out of PDF files that only contain images of scanned documents. It is javascript-only and hence works without the need to install further tools. Under the hood it uses pdf.js to render the pages within a pdf to png files, Jimp to create compressed jpeg images, tesseract.js to perform ocr and pdf-lib to merge the single page pdfs tesseract.js is creating into a final searchable output PDF file.

Usage

To create a searchable PDF with filename outputFilename from inputFilename use:

const { default: PdfOcr } = require('pdf-ocr-ts');

const inputFilename = './input/scan_test.pdf';
const outputFilename = './output/scan_test-searchable.pdf';

PdfOcr.createSearchablePdf(inputFilename, outputFilename);

In certain contexts it might be more handy to read the input file in some other function and also output the searchable PDF in another component. In these cases pdf-ocr-ts offers the function getSearchablePdfBufferBased(Uint8Array) that takes a Uint8Array (e.g. created by fs.readFile()), performs ocr and returns the searchable PDF file as Uint8Array. Which can than be used again in fs.writeFile().

const { default: PdfOcr } = require('pdf-ocr-ts');
const fs = require('fs');
const path = require('path');

const inputFilename = './input/scan_test.pdf';
const outputFilename = './output/scan_test-searchable.pdf';

(async () => {
    const pdf = new Uint8Array(fs.readFileSync(path.resolve(__dirname, inputFilename)));
    const { pdfBuffer, text } = await PdfOcr.getSearchablePdfBufferBased(pdf);
    fs.writeFile(path.resolve(__dirname, outputFilename), pdfBuffer, (error) => {
      if (error) {
          console.error(`Error: ${error}`);
      } else {
          console.log(`Finished merging PDFs into ${outputFilename}.`);
      }
    });
})();

To generate log output, pdf-ocr-ts supports logging frameworks. It ships with the most simple logger simpleLog and supports any logger with the call signature (level: string, message: string) => void; (see ./utils/Logger.ts).

const { default: PdfOcr } = require('pdf-ocr-ts');
const { simpleLog } = require("pdf-ocr-ts/build/utils/Logger");

const inputFilename = './input/scan_test.pdf';
const outputFilename = './output/scan_test-searchable.pdf';

PdfOcr.createSearchablePdf(inputFilename, outputFilename, simpleLog);

Here's an example for the log library winston.js via a simple wrapper like logHelper(level: string, message: string). Internally pdf-ocr-ts uses the log levels: info, error and debug.

const { default: PdfOcr } = require('pdf-ocr-ts');
const { createLogger, transports, format } = require("winston");

// create the winston logger
const logger = createLogger({
  transports: [new transports.Console()],
  format: format.combine(
    format.colorize(),
    format.timestamp(),
    format.printf(({ timestamp, level, message }) => {
      return `[${timestamp}] ${level}: ${message}`;
    })
  ),
});

// wrap winston logger in logHelper to comply with the call signature 
// (level: string, message: string) => void;
function logHelper(level: string, message: string) {
  logger.log(level, message);
}

// pass the logHelper function
PdfOcr.createSearchablePdf(inputFilename, outputFilename, logHelper);

To build the module from source run npm run build.

FAQs

Last updated on 17 Apr 2024

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc