a11ycat-ocr
OCR PDF documents in Node.js 🐱
Dependencies
Optional:
This is only needed if you are testing out the tess
method on the OCR
class. It is currently experimental, and does not return any of the OCR'd data, but is much faster than tesseract.js.
IMPORTANT: a11ycat-ocr
expects the ImageMagick tools to be availabe in $PATH
. If you are testing the tess
method on the OCR
class, then tesseract
must also be in $PATH
Quick Start
- Build the project from the repository
git clone https://github.com/devnoot/a11ycat-ocr.git a11ycat-ocr
cd a11ycat-ocr
npm install
npm build
- Include the OCR class in your project
const { A11yCat } = require('../../dist/index')
const { resolve } = require('path')
const ocr = new A11yCat.OCR()
async function main() {
try {
const pdfPath = resolve(process.cwd() + '/test/data/pdfs/set1/Modeling High-Frequency Limit Order Book Dynamics with Support Vector Machines.pdf')
const destinationDir = resolve(process.cwd() + '/tmp')
console.log('Generating images from PDF')
const generatedImages = await ocr.convertPdfToImages(pdfPath, destinationDir)
console.log('Doing OCR on first page')
await ocr.tess(generatedImages[0])
} catch (error) {
throw error
}
}
main()
Tests
Tests are located in test/spec
. Tests should use data from test/data/images
and test/data/pdfs
Because there are some large PDFs in the test dataset, this can take a very long time depending on the host computer.
npm test