New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More →

ocr-search

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

ocr-search

🔍 Find files that contain some text with OCR

1.0.0
latest
Source
npm

Version published: 3 years ago

Weekly downloads: 5; increased by25%

Maintainers: 1

Weekly downloads

Created: 3 years ago

Source

OCR Search

🔍 Find files that contain some text with OCR.

Supported file formats:

Images: JPEG, PNG, WebP
Documents: PDF

Unsupported file formats:

Images: AVIF, WebP 2 (.wp2), JPEG XL (.jxl)
Documents: Office (.docx, .xlsx, .pptx, ...)

Tesseract OCR is used internally (Tesseract Documentation). For PDF to PNG conversion, Poppler is used.

This package uses worker threads to make use of your CPU cores and be faster.

Notes:

The OCR will provide bad results for rotated files/non-straight text.
- 90/180 degrees rotations seems to output a good result
- You may want to pre-process your files somehow to make the text straight!
Files will be matched if at least 1 of the words is found in the text contained in it.

Install

No matter how you decide to use this package, you need to install Tesseract OCR anyway. If you have some PDF files, they need to be converted with additional packages.

# OCR Package (non-linux, see https://github.com/tesseract-ocr/tesseract#installing-tesseract)
sudo apt install tesseract-ocr

# PDF to JPEG conversion command-line (for Windows, see https://stackoverflow.com/a/53960829 - MacOS `brew install poppler`)
# You can skip this if you don't plan to scan PDF files
sudo apt install poppler-utils

OCR Language

If you want to use another language than English, download then install the required language from the Tesseract OCR Languages Models repository.

# French language
wget https://github.com/tesseract-ocr/tessdata_fast/raw/main/fra.traineddata
sudo cp fra.traineddata /usr/share/tesseract-ocr/4.00/tessdata/

Use with CLI

This will install the ocr-search CLI.

pnpm i -g ocr-search

$ ocr-search --help

  🔍 Find files that contain some text with OCR

  Usage
    $ ocr-search --words "<words_list>" <input_files>

  To delete images created from PDF files pages extractions, check the other provided command:
    $ ocr-search --help

  Required
    --words List of comma-separated words to search (if "MATCH_ALL", will match everything for mass OCR extraction)

  Options
    --ignoreExt         List of comma-separated file extensions to ignore (e.g. ".pdf,.jpg")
    --pdfExtractFirst   Range start of the pages to extract from PDF files (1-indexed)
    --pdfExtractLast    Range end of the pages to extract from PDF files, last page if overflow (1-indexed)
    --progressFile      File to save progress to, will start from where it
                        stopped last time by looking there (no file, use "none")  [default="progress.json"]
    --matchesLogFile    Log all matches to this file (no file, use "none") [default="matches.txt"]
    --no-console-logs   Silence all console logs
    --no-show-matches   Do not print matched files text content to the console [default="false"]
    --workers           Amount of worker threads to use (default is total CPU cores count - 2)

  OCR Options - See https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc
    --lang  Tesseract OCR LANG configuration [default="eng"]
    --oem   Tesseract OCR OEM configuration [default="1"]
    --psm   Tesseract OCR PSM configuration [default="1"]

  Examples
    Scan the "scanned-dir" directory and match all the files containing "system", "wiki" and "hello"
      $ ocr-search --words "system,wiki,hello" scanned-dir

    Scan the glob-matched files "*" and match all files (mass OCR text extraction)
      $ ocr-search --words MATCH_ALL *

    Skip .pdf and .webp files
      $ ocr-search --words "wiki,hello" --ignoreExt ".pdf,.webp" scanned-dir

    Extract only page 3 to 6 in all PDF files (1-indexed)
      $ ocr-search --words "wiki,hello" --pdfExtractFirst 3 --pdfExtractLast 6 scanned-dir

    Use a specific Tesseract OCR configuration
      $ ocr-search --words "wiki,hello" --lang fra --oem 1 --psm 3 scanned-dir

  https://github.com/rigwild/ocr-search

Another CLI is provided to easily remove all extracted PDF pages images.

$ ocr-search-clean --help

  🗑️ Find and remove content generated by ocr-search

  Usage
    $ ocr-search-clean [--pdf] [--txt] <input_files>

  Options
    --pdf  Remove images that were generated by PDF files pages extraction (e.g."file.pdf-1.png")
    --txt  Remove text files that were generated by OCR (option "--save-ocr" in "ocr-search")

  https://github.com/rigwild/ocr-search

Use with provided runner

git clone https://github.com/rigwild/ocr-search.git
cd ocr-search
pnpm install # or npm install -D
pnpm build

Put all your files/directories in the data directory. They can be in subfolders.

The progress will be printed to the console and saved in the progress.json file.

The list of files that match at least one of the provided words and their content will be saved to the matches.txt file.

node run.js

See run.js.

Use Programatically

Install

pnpm i ocr-search

Directory scan

import path from 'path'
import { scanDir, TesseractConfig } from 'ocr-search'

// The list of options
export type ScanOptions = {
  /**
   * List of words to search (if one is matched, the file is matched)
   *
   * If not provided, every files will get matched (useful to do mass OCR and save the result)
   */
  words?: string[]

  /** Should the OCR scanned content of each file be saved to a txt file (e.g. "file.png.txt") */
  saveOcr?: boolean

  /** Should the logs be printed to the console? (default = false) */
  shouldConsoleLog?: boolean

  /** Should the matches file content be printed to the console? (default = true) */
  shouldConsoleLogMatches?: boolean

  /**
   * If provided, the progress will be saved to a file
   *
   * When stopped, the process will start from where it stopped last time by looking there
   */
  progressFile?: string

  /** If provided, every file path and their text content that were matched are logged to this file */
  matchesLogFile?: string

  /** File extensions to ignore when looking for files (e.g. `new Set(['.pdf', '.jpg'])`) */
  ignoreExt?: Set<string>

  /* Extract PDF files starting at this page, first page is 1 (1-indexed) (default = 1) */
  pdfExtractFirst?: number

  /* Extract PDF files until this page, last page if overflow (1-indexed) (default = last page of PDF file) */
  pdfExtractLast?: number

  /**
   * Amount of worker threads to use (default = your total CPU cores - 2)
   *
   * Note: Using all your available cores may slow down the process!
   */
  workerPoolSize?: number

  /**
   * Tesseract OCR config, will default `{ lang: 'eng', oem: 1, psm: 1 }`
   *
   * @see https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc
   */
  tesseractConfig?: TesseractConfig
}

const scannedDir = path.resolve(__dirname, 'data')
const words = ['hello', 'match this', '<<<<<']
const tesseractConfig: TesseractConfig = { lang: 'fra', oem: 1, psm: 1 }

console.time('scan')

await scanDir(scannedDir, {
  words,
  shouldConsoleLog: true,
  tesseractConfig
})

console.log('Scan finished!')
console.timeEnd('scan')

Perform OCR on a single file

import path from 'path'
import { ocr } from 'ocr-search'

const file = path.resolve(__dirname, '..', 'test', '_testFiles', 'sample.jpg')

// Tesseract configuration
const tesseractConfig: TesseractConfig = { lang: 'eng', oem: 1, psm: 1 }

// Should the string be normalized? (lowercase, accents removed, whitespace removed)
const shouldCleanStr: boolean | undefined = true

const text = await ocr(file, tesseractConfig, shouldCleanStr)
console.log(text)

PDF to images conversion

Convert PDF pages to PNG. Files are generated on the file system, 1 file per PDF page.

import path from 'path'
import { pdfToImages } from 'ocr-search'

const filePdf = path.resolve(__dirname, '..', 'test', '_testFiles', 'sample.pdf')

// Extract from page 1 to page 3 (1-indexed)
const res = await pdfToImages(filePdf, 1, 3)
console.log(res) // Paths to generated PNG files

License

GNU Affero General Public License v3.0

Keywords

FAQs

What is ocr-search?

Is ocr-search popular?

Is ocr-search well maintained?

Package last updated on 12 Dec 2021

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install