a11ycat-ocr
OCR PDF documents in Node.js 🐱
Dependencies
IMPORTANT: a11ycat-ocr
expects the ImageMagick tools, as well as the tesseract
binary to be available in your PATH
Quick Start
- Build the project from the repository
git clone https://github.com/devnoot/a11ycat-ocr.git a11ycat-ocr
cd a11ycat-ocr
npm install
npm build
- Include the OCR class in your project
const { A11yCat } = require('../../dist/index')
const { resolve } = require('path')
const ocr = new A11yCat.OCR()
async function main() {
try {
const pdfPath = resolve(process.cwd() + '/test/data/pdfs/set1/Modeling High-Frequency Limit Order Book Dynamics with Support Vector Machines.pdf')
const destinationDir = resolve(process.cwd() + '/tmp')
console.log('Generating images from PDF')
const generatedImages = await ocr.convertPdfToImages(pdfPath, destinationDir)
console.log('Doing OCR on first page')
await ocr.tess(generatedImages[0])
} catch (error) {
throw error
}
}
main()
Tests
Tests are located in test/spec
. Tests should use data from test/data/images
and test/data/pdfs
Because there are some large PDFs in the test dataset, this can take a very long time depending on the host computer.
npm test