Bulk Files OCR Text Finder

🔍 Find files that contain some text with OCR.
Supported file formats:
- Images: JPEG, PNG, WebP
- Documents: PDF
Unsupported file formats:
Tesseract OCR is used internally (Tesseract Documentation). For PDF to PNG conversion, Poppler is used.
This package uses worker threads to make use of your CPU cores and be faster.
Notes:
- The OCR will provide bad results for rotated files/non-straight text.
- 90/180 degrees rotations seems to output a good result
- You may want to pre-process your files somehow to make the text straight!
- Files will be matched if at least 1 of the words is found in the text contained in it.
Install
No matter how you decide to use this package, you need to install Tesseract OCR anyway. If you have some PDF files, they need to be converted with additional packages.
sudo apt install tesseract-ocr
sudo apt install poppler-utils
OCR Language
If you want to use another language than English, download then install the required language from the Tesseract OCR Languages Models repository.
wget https://github.com/tesseract-ocr/tessdata_fast/raw/main/fra.traineddata
sudo cp fra.traineddata /usr/share/tesseract-ocr/4.00/tessdata/
Use with CLI
This will install the ocr-search CLI.
pnpm i -g bulk-files-ocr-search
🔍 Find files that contain some text with OCR
Usage
$ ocr-search --words "<words_list>" <input_files>
Required
--words List of comma-separated words to search (if "MATCH_ALL", will match everything for mass OCR extraction)
Options
--progressFile File to save progress to, will start from where it stopped last time by looking there (none="none") [default="progress.json"]
--matchesLogFile Log all matches to this file (none="none") [default="matches.txt"]
--no-console-logs Silence console logs
--workers Amount of worker threads to use (default is total CPU cores count - 2)
OCR Options - See https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc
--lang Tesseract OCR LANG configuration [default="eng"]
--oem Tesseract OCR OEM configuration [default="1"]
--psm Tesseract OCR PSM configuration [default="1"]
Examples
Scan the "scanned-dir" directory and match all the files containing "system", "wiki" and "hello"
$ ocr-search --words "system,wiki,hello" scanned-dir
Scan the glob-matched files "*" and match all files (mass OCR text extraction)
$ ocr-search --words MATCH_ALL *
Use a specific Tesseract OCR configuration
$ ocr-search --words "wiki,hello" --lang fra --oem 1 --psm 3 scanned-dir
Do not save progress and do not log matches to file
$ ocr-search --words "wiki,hello" --progressFile none --matchesLogFile none scanned-dir
https://github.com/rigwild/bulk-files-ocr-search
Use with provided runner
git clone bulk-files-ocr-search
cd bulk-files-ocr-search
pnpm install
pnpm build
Put all your files/directories in the data directory. They can be in subfolders.
The progress will be printed to the console and saved in the progress.json file.
The list of files that match at least one of the provided words and their content will be saved to the matches.txt file.
node run.js
See run.js.
Use Programatically
Install
pnpm i bulk-files-ocr-search
Directory scan
import path from 'path'
import { scanDir, TesseractConfig } from 'bulk-files-ocr-search'
export type ScanOptions = {
words: string[] | ['MATCH_ALL']
shouldConsoleLog?: boolean
progressFile?: string
matchesLogFile?: string
workerPoolSize?: number
tesseractConfig?: TesseractConfig
}
const words = ['hello', 'match this', '<<<<<']
const scannedDir = path.resolve(__dirname, 'data')
const progressFile = path.resolve(__dirname, 'progress.json')
const matchesLogFile = path.resolve(__dirname, 'matches.txt')
const tesseractConfig: TesseractConfig = { lang: 'eng', oem: 1, psm: 1 }
console.time('scan')
await scanDir(scannedDir, {
words,
shouldConsoleLog: true,
progressFile,
matchesLogFile,
tesseractConfig
})
console.log('Scan finished!')
console.timeEnd('scan')
Perform OCR on a single file
import path from 'path'
import { ocr } from 'bulk-files-ocr-search'
const file = path.resolve(__dirname, '..', 'test', '_testFiles', 'sample.jpg')
const tesseractConfig: TesseractConfig = { lang: 'eng', oem: 1, psm: 1 }
const shouldCleanStr: boolean | undefined = true
const text = await ocr(file, tesseractConfig, shouldCleanStr)
console.log(text)
PDF to images conversion
Convert PDF pages to PNG. Files are generated on the file system, 1 file per PDF page.
import path from 'path'
import { pdfToImages } from 'bulk-files-ocr-search'
const filePdf = path.resolve(__dirname, '..', 'test', '_testFiles', 'sample.pdf')
const res = await pdfToImages(filePdf)
console.log(res)
License
The MIT License