Socket
Socket
Sign inDemoInstall

tesseract.js

Package Overview
Dependencies
Maintainers
4
Versions
68
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

tesseract.js - npm Package Compare versions

Comparing version 5.1.0 to 5.1.1

4

docs/api.md

@@ -339,3 +339,3 @@ # API

`scheduler.getNumWorkers()` returns the length of job queue.
`scheduler.getQueueLen()` returns the length of job queue.

@@ -345,3 +345,3 @@ <a name="scheduler-get-num-workers"></a>

Scheduler.getNumWorkers() returns number of workers added into the scheduler
`Scheduler.getNumWorkers()` returns number of workers added into the scheduler

@@ -348,0 +348,0 @@ <a name="scheduler-terminate"></a>

@@ -27,3 +27,9 @@ FAQ

## Are PDF files supported?
Tesseract.js does not support .pdf directly—a separate library must be used to convert the .pdf files to images before Tesseract can recognize them. If you are an end user and want to use Tesseract.js to OCR a .pdf file, consider using [scribeocr.com](https://scribeocr.com/), a project that uses Tesseract.js and supports .pdf files. If you are a developer who wants to use Tesseract.js with .pdf files, you can use either of the libraries below to convert from .pdf to images.
Tesseract.js does not support PDF files. If you need to run OCR on PDF files, possible options are below.
### Use Scribe.js
[Scribe.js](https://github.com/scribeocr/scribe.js) is a library that builds on Tesseract.js and includes additional features, including native PDF support. Scribe.js supports running OCR on PDF files. Additionally, Scribe.js supports extracting text directly from text-native PDF files, which is significantly faster and more accurate compared to running OCR.
### Render PDFs to Images
The only way to recognize PDF files using Tesseract.js is to use a third-party library to render the `.pdf` file to a series of `.png` images, and then recognize those images using Tesseract.js. Libraries to consider are listed below.
1. [PDF.js](https://github.com/mozilla/pdf.js/) (Apache-2.0 license)

@@ -30,0 +36,0 @@ 2. [muPDF](https://github.com/ArtifexSoftware/mupdf) (AGPL-3.0 license)

@@ -52,1 +52,7 @@ # Overview

When working with schedulers, note that workers added to the same scheduler should all be homogenous—they should have the same language be configured with the same parameters. Schedulers assign jobs to workers in a non-deterministic manner, so if the workers are not identical then recognition results will depend on which worker the job is assigned to.
# Reusing Workers in Node.js Server Code
While workers and schedulers are reusable, and we recommend reusing them between jobs, using the same worker/scheduler for a week straight within Node.js server code will cause problems. Therefore, when using workers/schedulers within long-running Node.js server code, workers/schedulers should be killed and re-created every so often. For example, a scheduler could be terminated and re-created after every 500 jobs.
There are a couple reasons why periodically “resetting” workers/schedulers within server code is a good practice. First, due to general WebAssembly limitations, the memory allocated to workers can only expand over time. Therefore, a single large image will permanently increase the memory footprint of a worker. Second, workers “learn” over time by adding additional words they encounter in jobs to their internal dictionaries. While this behavior is useful within the context of a single document or group of documents, it is not necessarily desirable if recognizing hundreds of unrelated documents. If a single scheduler runs thousands of jobs over an entire week, the internal dictionary will eventually become bloated and include typos.
{
"name": "tesseract.js",
"version": "5.1.0",
"version": "5.1.1",
"description": "Pure Javascript Multilingual OCR",

@@ -72,3 +72,3 @@ "main": "src/index.js",

"regenerator-runtime": "^0.13.3",
"tesseract.js-core": "^5.1.0",
"tesseract.js-core": "^5.1.1",
"wasm-feature-detect": "^1.2.11",

@@ -75,0 +75,0 @@ "zlibjs": "^0.3.1"

@@ -132,3 +132,3 @@ const resolvePaths = require('./utils/resolvePaths');

gzip: options.gzip,
lstmOnly: [OEM.LSTM_ONLY, OEM.TESSERACT_LSTM_COMBINED].includes(currentOem)
lstmOnly: [OEM.DEFAULT, OEM.LSTM_ONLY].includes(currentOem)
&& !options.legacyLang,

@@ -135,0 +135,0 @@ },

Sorry, the diff of this file is too big to display

Sorry, the diff of this file is not supported yet

Sorry, the diff of this file is too big to display

Sorry, the diff of this file is not supported yet

Sorry, the diff of this file is too big to display

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc