tesseract.js
Advanced tools
Comparing version 5.1.0 to 5.1.1
@@ -339,3 +339,3 @@ # API | ||
`scheduler.getNumWorkers()` returns the length of job queue. | ||
`scheduler.getQueueLen()` returns the length of job queue. | ||
@@ -345,3 +345,3 @@ <a name="scheduler-get-num-workers"></a> | ||
Scheduler.getNumWorkers() returns number of workers added into the scheduler | ||
`Scheduler.getNumWorkers()` returns number of workers added into the scheduler | ||
@@ -348,0 +348,0 @@ <a name="scheduler-terminate"></a> |
@@ -27,3 +27,9 @@ FAQ | ||
## Are PDF files supported? | ||
Tesseract.js does not support .pdf directly—a separate library must be used to convert the .pdf files to images before Tesseract can recognize them. If you are an end user and want to use Tesseract.js to OCR a .pdf file, consider using [scribeocr.com](https://scribeocr.com/), a project that uses Tesseract.js and supports .pdf files. If you are a developer who wants to use Tesseract.js with .pdf files, you can use either of the libraries below to convert from .pdf to images. | ||
Tesseract.js does not support PDF files. If you need to run OCR on PDF files, possible options are below. | ||
### Use Scribe.js | ||
[Scribe.js](https://github.com/scribeocr/scribe.js) is a library that builds on Tesseract.js and includes additional features, including native PDF support. Scribe.js supports running OCR on PDF files. Additionally, Scribe.js supports extracting text directly from text-native PDF files, which is significantly faster and more accurate compared to running OCR. | ||
### Render PDFs to Images | ||
The only way to recognize PDF files using Tesseract.js is to use a third-party library to render the `.pdf` file to a series of `.png` images, and then recognize those images using Tesseract.js. Libraries to consider are listed below. | ||
1. [PDF.js](https://github.com/mozilla/pdf.js/) (Apache-2.0 license) | ||
@@ -30,0 +36,0 @@ 2. [muPDF](https://github.com/ArtifexSoftware/mupdf) (AGPL-3.0 license) |
@@ -52,1 +52,7 @@ # Overview | ||
When working with schedulers, note that workers added to the same scheduler should all be homogenous—they should have the same language be configured with the same parameters. Schedulers assign jobs to workers in a non-deterministic manner, so if the workers are not identical then recognition results will depend on which worker the job is assigned to. | ||
# Reusing Workers in Node.js Server Code | ||
While workers and schedulers are reusable, and we recommend reusing them between jobs, using the same worker/scheduler for a week straight within Node.js server code will cause problems. Therefore, when using workers/schedulers within long-running Node.js server code, workers/schedulers should be killed and re-created every so often. For example, a scheduler could be terminated and re-created after every 500 jobs. | ||
There are a couple reasons why periodically “resetting” workers/schedulers within server code is a good practice. First, due to general WebAssembly limitations, the memory allocated to workers can only expand over time. Therefore, a single large image will permanently increase the memory footprint of a worker. Second, workers “learn” over time by adding additional words they encounter in jobs to their internal dictionaries. While this behavior is useful within the context of a single document or group of documents, it is not necessarily desirable if recognizing hundreds of unrelated documents. If a single scheduler runs thousands of jobs over an entire week, the internal dictionary will eventually become bloated and include typos. | ||
{ | ||
"name": "tesseract.js", | ||
"version": "5.1.0", | ||
"version": "5.1.1", | ||
"description": "Pure Javascript Multilingual OCR", | ||
@@ -72,3 +72,3 @@ "main": "src/index.js", | ||
"regenerator-runtime": "^0.13.3", | ||
"tesseract.js-core": "^5.1.0", | ||
"tesseract.js-core": "^5.1.1", | ||
"wasm-feature-detect": "^1.2.11", | ||
@@ -75,0 +75,0 @@ "zlibjs": "^0.3.1" |
@@ -132,3 +132,3 @@ const resolvePaths = require('./utils/resolvePaths'); | ||
gzip: options.gzip, | ||
lstmOnly: [OEM.LSTM_ONLY, OEM.TESSERACT_LSTM_COMBINED].includes(currentOem) | ||
lstmOnly: [OEM.DEFAULT, OEM.LSTM_ONLY].includes(currentOem) | ||
&& !options.legacyLang, | ||
@@ -135,0 +135,0 @@ }, |
Sorry, the diff of this file is too big to display
Sorry, the diff of this file is not supported yet
Sorry, the diff of this file is too big to display
Sorry, the diff of this file is not supported yet
Sorry, the diff of this file is too big to display
License Policy Violation
LicenseThis package is not allowed per your license policy. Review the package's license to ensure compliance.
Found 1 instance in 1 package
License Policy Violation
LicenseThis package is not allowed per your license policy. Review the package's license to ensure compliance.
Found 1 instance in 1 package
1463025
3407
Updatedtesseract.js-core@^5.1.1