OCR Search
![license](https://img.shields.io/npm/l/ocr-search?color=blue)
🔍 Find files that contain some text with OCR.
Supported file formats:
- Images: JPEG, PNG, WebP
- Documents: PDF
Unsupported file formats:
Tesseract OCR is used internally (Tesseract Documentation). For PDF to PNG conversion, Poppler is used.
This package uses worker threads to make use of your CPU cores and be faster.
Notes:
- The OCR will provide bad results for rotated files/non-straight text.
- 90/180 degrees rotations seems to output a good result
- You may want to pre-process your files somehow to make the text straight!
- Files will be matched if at least 1 of the words is found in the text contained in it.
Install
No matter how you decide to use this package, you need to install Tesseract OCR anyway. If you have some PDF files, they need to be converted with additional packages.
sudo apt install tesseract-ocr
sudo apt install poppler-utils
OCR Language
If you want to use another language than English, download then install the required language from the Tesseract OCR Languages Models repository.
wget https://github.com/tesseract-ocr/tessdata_fast/raw/main/fra.traineddata
sudo cp fra.traineddata /usr/share/tesseract-ocr/4.00/tessdata/
Use with CLI
This will install the ocr-search
CLI.
pnpm i -g ocr-search
$ ocr-search --help
🔍 Find files that contain some text with OCR
Usage
$ ocr-search --words "<words_list>" <input_files>
To delete images created from PDF files pages extractions, check the other provided command:
$ ocr-search --help
Required
--words List of comma-separated words to search (if "MATCH_ALL", will match everything for mass OCR extraction)
Options
--ignoreExt List of comma-separated file extensions to ignore (e.g. ".pdf,.jpg")
--pdfExtractFirst Range start of the pages to extract from PDF files (1-indexed)
--pdfExtractLast Range end of the pages to extract from PDF files, last page if overflow (1-indexed)
--progressFile File to save progress to, will start from where it
stopped last time by looking there (no file, use "none") [default="progress.json"]
--matchesLogFile Log all matches to this file (no file, use "none") [default="matches.txt"]
--no-console-logs Silence all console logs
--no-show-matches Do not print matched files text content to the console [default="false"]
--workers Amount of worker threads to use (default is total CPU cores count - 2)
OCR Options - See https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc
--lang Tesseract OCR LANG configuration [default="eng"]
--oem Tesseract OCR OEM configuration [default="1"]
--psm Tesseract OCR PSM configuration [default="1"]
Examples
Scan the "scanned-dir" directory and match all the files containing "system", "wiki" and "hello"
$ ocr-search --words "system,wiki,hello" scanned-dir
Scan the glob-matched files "*" and match all files (mass OCR text extraction)
$ ocr-search --words MATCH_ALL *
Skip .pdf and .webp files
$ ocr-search --words "wiki,hello" --ignoreExt ".pdf,.webp" scanned-dir
Extract only page 3 to 6 in all PDF files (1-indexed)
$ ocr-search --words "wiki,hello" --pdfExtractFirst 3 --pdfExtractLast 6 scanned-dir
Use a specific Tesseract OCR configuration
$ ocr-search --words "wiki,hello" --lang fra --oem 1 --psm 3 scanned-dir
https://github.com/rigwild/ocr-search
Another CLI is provided to easily remove all extracted PDF pages images.
$ ocr-search-clean --help
🗑️ Find and remove content generated by ocr-search
Usage
$ ocr-search-clean [--pdf] [--txt] <input_files>
Options
--pdf Remove images that were generated by PDF files pages extraction (e.g."file.pdf-1.png")
--txt Remove text files that were generated by OCR (option "--save-ocr" in "ocr-search")
https://github.com/rigwild/ocr-search
Use with provided runner
git clone https://github.com/rigwild/ocr-search.git
cd ocr-search
pnpm install
pnpm build
Put all your files/directories in the data
directory. They can be in subfolders.
The progress will be printed to the console and saved in the progress.json
file.
The list of files that match at least one of the provided words and their content will be saved to the matches.txt
file.
node run.js
See run.js
.
Use Programatically
Install
pnpm i ocr-search
Directory scan
import path from 'path'
import { scanDir, TesseractConfig } from 'ocr-search'
export type ScanOptions = {
words?: string[]
saveOcr?: boolean
shouldConsoleLog?: boolean
shouldConsoleLogMatches?: boolean
progressFile?: string
matchesLogFile?: string
ignoreExt?: Set<string>
pdfExtractFirst?: number
pdfExtractLast?: number
workerPoolSize?: number
tesseractConfig?: TesseractConfig
}
const scannedDir = path.resolve(__dirname, 'data')
const words = ['hello', 'match this', '<<<<<']
const tesseractConfig: TesseractConfig = { lang: 'fra', oem: 1, psm: 1 }
console.time('scan')
await scanDir(scannedDir, {
words,
shouldConsoleLog: true,
tesseractConfig
})
console.log('Scan finished!')
console.timeEnd('scan')
Perform OCR on a single file
import path from 'path'
import { ocr } from 'ocr-search'
const file = path.resolve(__dirname, '..', 'test', '_testFiles', 'sample.jpg')
const tesseractConfig: TesseractConfig = { lang: 'eng', oem: 1, psm: 1 }
const shouldCleanStr: boolean | undefined = true
const text = await ocr(file, tesseractConfig, shouldCleanStr)
console.log(text)
PDF to images conversion
Convert PDF pages to PNG. Files are generated on the file system, 1 file per PDF page.
import path from 'path'
import { pdfToImages } from 'ocr-search'
const filePdf = path.resolve(__dirname, '..', 'test', '_testFiles', 'sample.pdf')
const res = await pdfToImages(filePdf, 1, 3)
console.log(res)
License
GNU Affero General Public License v3.0