Security News
pnpm 10.0.0 Blocks Lifecycle Scripts by Default
pnpm 10 blocks lifecycle scripts by default to improve security, addressing supply chain attack risks but sparking debate over compatibility and workflow changes.
tesseract.js-core
Advanced tools
tesseract.js-core is a JavaScript library that provides core functionalities for optical character recognition (OCR) using the Tesseract OCR engine. It allows developers to extract text from images directly in the browser or in a Node.js environment.
Basic OCR
This code demonstrates how to perform basic OCR using tesseract.js-core. It initializes a worker, loads the necessary language data, and processes an image to extract text.
const TesseractCore = require('tesseract.js-core');
const { createWorker } = require('tesseract.js');
const worker = createWorker({
corePath: TesseractCore
});
(async () => {
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize('path/to/image.png');
console.log(text);
await worker.terminate();
})();
OCR with Progress Updates
This code sample shows how to perform OCR with progress updates. The logger function is used to log progress messages to the console.
const TesseractCore = require('tesseract.js-core');
const { createWorker } = require('tesseract.js');
const worker = createWorker({
corePath: TesseractCore,
logger: m => console.log(m)
});
(async () => {
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize('path/to/image.png');
console.log(text);
await worker.terminate();
})();
OCR with Multiple Languages
This code demonstrates how to perform OCR on an image using multiple languages (English and Spanish in this case).
const TesseractCore = require('tesseract.js-core');
const { createWorker } = require('tesseract.js');
const worker = createWorker({
corePath: TesseractCore
});
(async () => {
await worker.load();
await worker.loadLanguage('eng+spa');
await worker.initialize('eng+spa');
const { data: { text } } = await worker.recognize('path/to/image.png');
console.log(text);
await worker.terminate();
})();
ocrad.js is a JavaScript port of the OCRAD OCR engine. It is a pure JavaScript library that can be used in the browser or in Node.js. Compared to tesseract.js-core, ocrad.js is simpler and may be easier to integrate for basic OCR tasks, but it may not be as powerful or accurate as Tesseract.
node-tesseract-ocr is a Node.js wrapper for the Tesseract OCR engine. It provides a simple interface for performing OCR on images. While it offers similar functionalities to tesseract.js-core, it is specifically designed for Node.js and may not be suitable for browser environments.
Core part of tesseract.js, which compiles original tesseract from C to JavaScript WebAssembly.
To build tesseract-core.js by yourself, please install docker and run:
bash build-with-docker.sh
The generated files will be stored in root path. When compiling, errors sometimes occur due to race conditions (some dependencies do not appear to compile properly in parallel). Re-running generally resolves.
build-scripts
folderjavascript
folderthird_party
folder
CMakeLists.txt
to build with emscriptenltrresultiterator.h
and ltrresultiterator.cpp
to add WordChoiceIterator
classsrc/arch_sse
folder, which is used instead of src/arch
for the simd-enabled build
src/textord/colfind.cpp
to prevent this from printing to consolesrc/ccmain/thresholder.cpp
, src/ccmain/thresholder.h
, src/api/baseapi.cpp
, and include/tesseract/baseapi.h
to add exif
and angle
arguments for rotating imagesFindLines
from "protected" to "public" in baseapi.h
to expose to Javascript
GetGradient
function to baseapi.h
and baseapi.cpp
for reporting page angle
src/ccmain/tesseractclass.h
, src/ccmain/pagesegmain.cpp
, src/textord/textord.cpp
, and src/textord/textord.h
WriteImage
function to baseapi.h
and baseapi.cpp
for saving images (original, grey, and binary)SaveParameters
and RestoreParameters
functions to baseapi.h
and baseapi.cpp
for saving and restoring parametersEM_ASM_ARGS
to src/ccmain/control.cpp
for progress logging (and added <emscripten.h>
header)tprintf
function in src/ccutil/tprintf.cpp
to force flushingSetImage
to src/api/baseapi.cpp
and include/tesseract/baseapi.h
that reads image from filesystem
ParamUtils::PrintParams
in src/ccutil/params.cpp
to remove description text (resolves bug)
src/ccmain/tessedit.cpp
to save error log to separate file (/debugDev.txt
)src/api/jsonrenderer.cpp
, modified CMakeLists.txt
, include/tesseract/baseapi.h
, and include/tesseract/renderer.h
To run the browser examples, launch a web server in the root of the repo (i.e. run http-server
). Then navigate to the pages in examples/web/minimal/
in your browser.
To run the node examples, navigate to examples/node/minimal/
and then run e.g. node index.wasm.js [input_file]
.
The "benchmark" examples behave similarly, except that they take longer to run and report runtime instead of recognition text. All other examples are experimental and should not be expected to run.
As we leverage git-submodule to manage dependencies, remember to add recursive when cloning the repository:
git clone --recursive https://github.com/naptha/tesseract.js-core
FAQs
Tesseract C++ API in Pure Javascript
We found that tesseract.js-core demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 4 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
pnpm 10 blocks lifecycle scripts by default to improve security, addressing supply chain attack risks but sparking debate over compatibility and workflow changes.
Product
Socket now supports uv.lock files to ensure consistent, secure dependency resolution for Python projects and enhance supply chain security.
Research
Security News
Socket researchers have discovered multiple malicious npm packages targeting Solana private keys, abusing Gmail to exfiltrate the data and drain Solana wallets.