🚀 Big News:Socket Has Acquired Secure Annex.Learn More →

Book a Demo Sign in

bulk-files-ocr-search

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

bulk-files-ocr-search

🔍 Find files that contain some text with OCR

Source

npm

Version: 0.1.4

Version published: 4 years ago

Maintainers: 1

Created: 4 years ago

Source

Bulk Files OCR Text Finder

🔍 Find files that contain some text with OCR.

Supported file formats:

Images: JPEG, PNG, WebP
Documents: PDF

Unsupported file formats:

Images: AVIF, WebP 2 (.wp2), JPEG XL (.jxl)
Documents: Office (.docx, .xlsx, .pptx, ...)

Tesseract OCR is used internally (Tesseract Documentation). For PDF to PNG conversion, Poppler is used.

This package uses worker threads to make use of your CPU cores and be faster.

Notes:

The OCR will provide bad results for rotated files/non-straight text.
- 90/180 degrees rotations seems to output a good result
- You may want to pre-process your files somehow to make the text straight!
Files will be matched if at least 1 of the words is found in the text contained in it.

Install

No matter how you decide to use this package, you need to install Tesseract OCR anyway. If you have some PDF files, they need to be converted with additional packages.

# OCR Package (non-linux, see https://github.com/tesseract-ocr/tesseract#installing-tesseract)
sudo apt install tesseract-ocr

# PDF to JPEG conversion command-line (for Windows, see https://stackoverflow.com/a/53960829 - MacOS `brew install poppler`)
# You can skip this if you don't plan to scan PDF files
sudo apt install poppler-utils

OCR Language

If you want to use another language than English, download then install the required language from the Tesseract OCR Languages Models repository.

# French language
wget https://github.com/tesseract-ocr/tessdata_fast/raw/main/fra.traineddata
sudo cp fra.traineddata /usr/share/tesseract-ocr/4.00/tessdata/

Use with CLI

This will install the ocr-search CLI.

pnpm i -g bulk-files-ocr-search

  🔍 Find files that contain some text with OCR

  Usage
    $ ocr-search --words "<words_list>" <input_files>

  Required
    --words List of comma-separated words to search (if "MATCH_ALL", will match everything for mass OCR extraction)

  Options
    --progressFile     File to save progress to, will start from where it stopped last time by looking there (none="none")  [default="progress.json"]
    --matchesLogFile   Log all matches to this file (none="none") [default="matches.txt"]
    --no-console-logs  Silence console logs
    --workers          Amount of worker threads to use (default is total CPU cores count - 2)

  OCR Options - See https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc
    --lang  Tesseract OCR LANG configuration [default="eng"]
    --oem   Tesseract OCR OEM configuration [default="1"]
    --psm   Tesseract OCR PSM configuration [default="1"]

  Examples
    Scan the "scanned-dir" directory and match all the files containing "system", "wiki" and "hello"
      $ ocr-search --words "system,wiki,hello" scanned-dir

    Scan the glob-matched files "*" and match all files (mass OCR text extraction)
      $ ocr-search --words MATCH_ALL *

    Use a specific Tesseract OCR configuration
      $ ocr-search --words "wiki,hello" --lang fra --oem 1 --psm 3 scanned-dir

    Do not save progress and do not log matches to file
      $ ocr-search --words "wiki,hello" --progressFile none --matchesLogFile none scanned-dir

  https://github.com/rigwild/bulk-files-ocr-search

Use with provided runner

git clone bulk-files-ocr-search
cd bulk-files-ocr-search
pnpm install # or npm install -D
pnpm build

Put all your files/directories in the data directory. They can be in subfolders.

The progress will be printed to the console and saved in the progress.json file.

The list of files that match at least one of the provided words and their content will be saved to the matches.txt file.

node run.js

See run.js.

Use Programatically

Install

pnpm i bulk-files-ocr-search

Directory scan

import path from 'path'
import { scanDir, TesseractConfig } from 'bulk-files-ocr-search'

// The list of options
export type ScanOptions = {
  /**
   * List of words to search (if one is matched, the file is matched)
   *
   * If not provided, every files will get matched (useful to do mass OCR and save the result)
   */
  words: string[] | ['MATCH_ALL']

  /**
   * Should the logs be printed to the console? (default = false)
   */
  shouldConsoleLog?: boolean

  /**
   * If provided, the progress will be saved to a file
   *
   * When stopped, the process will start from where it stopped last time by looking there
   */
  progressFile?: string

  /**
   * If provided, every file path and their text content that were matched are logged to this file
   */
  matchesLogFile?: string

  /**
   * Amount of worker threads to use (default = your total CPU cores - 2)
   *
   * Note: Using all your available cores may slow down the process!
   */
  workerPoolSize?: number

  /**
   * Tesseract OCR config, will default to `{ lang: 'eng', oem: 1, psm: 1 }`
   *
   * @see https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc
   */
  tesseractConfig?: TesseractConfig
}

const words = ['hello', 'match this', '<<<<<']

const scannedDir = path.resolve(__dirname, 'data')
const progressFile = path.resolve(__dirname, 'progress.json')
const matchesLogFile = path.resolve(__dirname, 'matches.txt')
const tesseractConfig: TesseractConfig = { lang: 'eng', oem: 1, psm: 1 }

console.time('scan')

await scanDir(scannedDir, {
  words,
  shouldConsoleLog: true,
  progressFile,
  matchesLogFile,
  tesseractConfig
})

console.log('Scan finished!')
console.timeEnd('scan')

Perform OCR on a single file

import path from 'path'
import { ocr } from 'bulk-files-ocr-search'

const file = path.resolve(__dirname, '..', 'test', '_testFiles', 'sample.jpg')

// Tesseract configuration
const tesseractConfig: TesseractConfig = { lang: 'eng', oem: 1, psm: 1 }

// Should the string be normalized? (lowercase, accents removed, whitespace removed)
const shouldCleanStr: boolean | undefined = true

const text = await ocr(file, tesseractConfig, shouldCleanStr)
console.log(text)

PDF to images conversion

Convert PDF pages to PNG. Files are generated on the file system, 1 file per PDF page.

import path from 'path'
import { pdfToImages } from 'bulk-files-ocr-search'

const filePdf = path.resolve(__dirname, '..', 'test', '_testFiles', 'sample.pdf')

const res = await pdfToImages(filePdf)
console.log(res) // Paths to generated PNG files

License

The MIT License

Keywords

FAQs

What is bulk-files-ocr-search?

Is bulk-files-ocr-search well maintained?

Package last updated on 10 Dec 2021

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

bulk-files-ocr-search

Bulk Files OCR Text Finder

Install

OCR Language

Use with CLI

Use with provided runner

Use Programatically

Install

Directory scan

Perform OCR on a single file

PDF to images conversion

License

Keywords

Related posts

Malicious Ruby Gems and Go Modules Impersonate Developer Tools to Steal Secrets and Poison CI

Mini Shai-Hulud Spreads to Packagist: Malicious Intercom PHP Package Follows npm Compromise