a11ycat-ocr
OCR PDF documents in Node.js 🐱
Dependencies
Optional:
This is only needed if you are testing out the tess
method on the OCR
class. This is much faster than the recognize
method on the OCR
class since it uses tesseract.js, but yeilds less information.
IMPORTANT: a11ycat-ocr
expects the ImageMagick tools to be availabe in $PATH
. If you are testing the tess
method on the OCR
class, then tesseract
must also be in $PATH
Quick Start
- Build the project from the repository
git clone https://github.com/devnoot/a11ycat-ocr.git a11ycat-ocr
cd a11ycat-ocr
npm install
npm build
- Include the OCR class in your project
const { A11yCat } = require('../../dist/index')
const { resolve } = require('path')
const ocr = new A11yCat.OCR()
async function main() {
try {
const pdfPath = '/path/to/my.pdf'
const destinationDir = resolve(process.cwd() + '/tmp')
const generatedImages = await ocr.convertPdfToImages(pdfPath, destinationDir)
const textFile = await ocr.tess(generatedImages[0])
} catch (error) {
throw error
}
}
main()
Tests
Tests are located in test/spec
. Tests should use data from test/data/images
and test/data/pdfs
Because there are some large PDFs in the test dataset, this can take a very long time depending on the host computer.
npm test