What is tesseract.js-core?
tesseract.js-core is a JavaScript library that provides core functionalities for optical character recognition (OCR) using the Tesseract OCR engine. It allows developers to extract text from images directly in the browser or in a Node.js environment.
What are tesseract.js-core's main functionalities?
Basic OCR
This code demonstrates how to perform basic OCR using tesseract.js-core. It initializes a worker, loads the necessary language data, and processes an image to extract text.
const TesseractCore = require('tesseract.js-core');
const { createWorker } = require('tesseract.js');
const worker = createWorker({
corePath: TesseractCore
});
(async () => {
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize('path/to/image.png');
console.log(text);
await worker.terminate();
})();
OCR with Progress Updates
This code sample shows how to perform OCR with progress updates. The logger function is used to log progress messages to the console.
const TesseractCore = require('tesseract.js-core');
const { createWorker } = require('tesseract.js');
const worker = createWorker({
corePath: TesseractCore,
logger: m => console.log(m)
});
(async () => {
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize('path/to/image.png');
console.log(text);
await worker.terminate();
})();
OCR with Multiple Languages
This code demonstrates how to perform OCR on an image using multiple languages (English and Spanish in this case).
const TesseractCore = require('tesseract.js-core');
const { createWorker } = require('tesseract.js');
const worker = createWorker({
corePath: TesseractCore
});
(async () => {
await worker.load();
await worker.loadLanguage('eng+spa');
await worker.initialize('eng+spa');
const { data: { text } } = await worker.recognize('path/to/image.png');
console.log(text);
await worker.terminate();
})();
Other packages similar to tesseract.js-core
ocrad.js
ocrad.js is a JavaScript port of the OCRAD OCR engine. It is a pure JavaScript library that can be used in the browser or in Node.js. Compared to tesseract.js-core, ocrad.js is simpler and may be easier to integrate for basic OCR tasks, but it may not be as powerful or accurate as Tesseract.
node-tesseract-ocr
node-tesseract-ocr is a Node.js wrapper for the Tesseract OCR engine. It provides a simple interface for performing OCR on images. While it offers similar functionalities to tesseract.js-core, it is specifically designed for Node.js and may not be suitable for browser environments.
tesseract.js-core

Core part of tesseract.js, which compiles original tesseract from C to JavaScript WebAssembly.
Structure
- Build scripts are in
build-scripts
folder
- Javascript/wrapper files are in
javascript
folder
- All dependencies (including Tesseract) are in
third_party
folder
- All dependencies are unmodified except for Tesseract, which uses a forked repo
- The Tesseract repo has the following changes:
- Modified
CMakeLists.txt
to build with emscripten
- Modified
ltrresultiterator.h
and ltrresultiterator.cpp
to add WordChoiceIterator
class
- Added
src/arch_see
folder, which is used instead of src/arch
for the simd-enabled build
- This hard-codes the use of the SSE function
- Commented out "Empty page!!" message in
src/textord/colfind.cpp
to prevent this from printing to console
- Modified
src/ccmain/thresholder.cpp
, src/ccmain/thresholder.h
, src/api/baseapi.cpp
, and include/tesseract/baseapi.h
to add option for rotating images using exif orientation tag
- Added calls to
EM_ASM_ARGS
to src/ccmain/control.cpp
for progress logging (and added <emscripten.h>
header)
Running Minimal Examples
To run the browser examples, launch a web server in the root of the repo (i.e. run http-server
). Then navigate to the pages in examples/web/minimal/
in your browser.
To run the node examples, navigate to examples/node/minimal/
and then run e.g. node index.wasm.js
.
The "benchmark" examples behave similarly, except that they take longer to run and report runtime instead of recognition text. All other examples are experimental and should not be expected to run.
Contribution
As we leverage git-submodule to manage dependencies, remember to add recursive when cloning the repository:
$ git clone --recursive https://github.com/naptha/tesseract.js-core
To build tesseract-core.js by yourself, please install docker and run:
$ bash build-with-docker.sh
The genreated files will be stored in root path.