New Research: Supply Chain Attack on Axios Pulls Malicious Dependency from npm.Details → →

Book a Demo Sign in

pdf-ocr-cli

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

pdf-ocr-cli

A CLI tool for OCR processing of PDF files using Mistral API with optional LLM verification

latest

Source

npm

Version: 1.0.1

Version published: 11 months ago

Maintainers: 1

Created: 11 months ago

Source

PDF-OCR CLI Tool

Overview

A powerful TypeScript CLI tool that transforms scanned PDFs into searchable documents by:

Taking a PDF file input
Processing each page with Mistral API's OCR capabilities
Optionally verifying and improving text quality with Together.ai's free LLM
Reassembling everything into a searchable PDF

Perfect for digitizing paper documents, making image-based PDFs searchable, and extracting text from scanned materials.

Quick Start

Prerequisites

Node.js 14 or higher
Mistral API key (sign up here)
Together.ai API key for verification feature (sign up here)

Installation

# Install globally
npm install -g pdf-ocr-cli

# Or use without installing
npx pdf-ocr-cli --input input.pdf --output output.pdf

Set Up API Keys

Create a .env file in your working directory:

echo "MISTRAL_API_KEY=your_mistral_api_key_here" > .env
echo "TOGETHER_API_KEY=your_together_api_key_here" >> .env

Or set environment variables in your shell:

export MISTRAL_API_KEY=your_mistral_api_key_here
export TOGETHER_API_KEY=your_together_api_key_here

Basic Usage

# Process a PDF file
pdf-ocr --input input.pdf --output output.pdf

# With verification to improve OCR quality
pdf-ocr --input input.pdf --output output.pdf --verify

Common Use Cases

Process Large Documents Efficiently

# Process 3 pages at a time
pdf-ocr --input input.pdf --output output.pdf --concurrency 3

Handle Network Issues

# Increase retries and timeout for unstable connections
pdf-ocr --input input.pdf --output output.pdf --retries 5 --timeout 60000

Process Carefully with Detailed Logs

# Process one page at a time with longer pauses and verbose logging
pdf-ocr --input input.pdf --output output.pdf --concurrency 1 --sleep 10000 --verbose

Command Options

Basic Options

Option	Alias	Description	Default
`--input`	`-i`	Input PDF file path	Required
`--output`	`-o`	Output PDF file path	Required
`--concurrency`	`-c`	Pages to process in parallel	2
`--max-pages`	`-m`	Maximum pages to process	All
`--help`	`-h`	Display help information
`--version`	`-v`	Display version information

OCR Options

Option	Alias	Description	Default
`--retries`	`-r`	Maximum OCR retry attempts	3
`--retry-delay`	`-d`	Delay between retries (ms)	1000
`--timeout`	`-t`	OCR API request timeout (ms)	30000
`--sleep`	`-s`	Time between processing pages (ms)	5000
`--verbose`	`-v`	Enable detailed logging

Verification Options

Option	Description	Default
`--verify`	Enable LLM verification
`--max-tokens`	Maximum tokens for verification	1000
`--temperature`	Temperature for verification	0.7
`--top-p`	Top-p for verification	0.9

Advanced Installation

Install from Source

# Clone and build
git clone https://github.com/luandro/pdf-ocr.git
cd pdf-ocr
npm install
npm run build

# Set up environment
cp .env.example .env
# Edit .env with your API keys

Development

This project follows Test-Driven Development principles:

# Run tests with coverage
npm test

# Run tests in watch mode
npm run test:watch

# Build the project
npm run build

# Run in development mode
npm run dev -- --input input.pdf --output output.pdf

Test Coverage

The project maintains high test coverage (>80%) for quality assurance:

# Run tests with coverage
npm test

# View coverage report
open coverage/lcov-report/index.html

Continuous Integration

GitHub Actions automates testing and publishing:

Tests run on every push to main
Coverage reports are generated
Automatic npm publishing when tests pass

Architecture

The application consists of these key modules:

PDF Splitter (src/splitPdf.ts): Divides PDFs into individual pages
OCR Module (src/ocr.ts): Extracts text using Mistral API
Content Verification (src/contentVerification.ts): Improves text with LLM
Text-to-PDF Converter (src/textToPdf.ts): Converts text back to PDF
PDF Merger (src/mergePdfs.ts): Combines processed pages
CLI (src/cli.ts): Provides the command interface

Processing Pipeline

Split input PDF into individual pages
Process each page sequentially:
- Extract text with Mistral API OCR
- Optionally verify/improve text with Together.ai
- Convert text back to PDF format
Merge all processed pages into final PDF

Troubleshooting

API Key Errors: Ensure your .env file contains valid API keys
Network Issues: Try increasing --retries, --timeout, and --retry-delay
Poor OCR Quality: Enable --verify to improve text with LLM
Processing Large Files: Reduce --concurrency and increase --sleep
Memory Issues: Process fewer pages at once with --max-pages

Contributing

Please see CONTRIBUTING.md for guidelines on contributing to this project.

License

This project is licensed under the ISC License - see the LICENSE file for details.

Keywords

FAQs

What is pdf-ocr-cli?

Is pdf-ocr-cli well maintained?

Package last updated on 03 May 2025

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

pdf-ocr-cli

PDF-OCR CLI Tool

Overview

Quick Start

Prerequisites

Installation

Set Up API Keys

Basic Usage

Common Use Cases

Process Large Documents Efficiently

Handle Network Issues

Process Carefully with Detailed Logs

Command Options

Basic Options

OCR Options

Verification Options

Advanced Installation

Install from Source

Development

Test Coverage

Continuous Integration

Architecture

Processing Pipeline

Troubleshooting

Contributing

License

Keywords

Related posts

Axios Maintainer Confirms Social Engineering Attack Behind npm Compromise

Node.js Drops Bug Bounty Rewards After Funding Dries Up