tesseract.js
Advanced tools
Comparing version 4.1.4 to 5.0.0
270
docs/api.md
# API | ||
- [createWorker()](#create-worker) | ||
- [Worker.recognize](#worker-recognize) | ||
- [Worker.setParameters](#worker-set-parameters) | ||
- [Worker.reinitialize](#worker-reinitialize) | ||
- [Worker.detect](#worker-detect) | ||
- [Worker.terminate](#worker-terminate) | ||
- [Worker.writeText](#worker-writeText) | ||
@@ -8,8 +13,2 @@ - [Worker.readText](#worker-readText) | ||
- [Worker.FS](#worker-FS) | ||
- [Worker.loadLanguage](#worker-load-language) | ||
- [Worker.initialize](#worker-initialize) | ||
- [Worker.setParameters](#worker-set-parameters) | ||
- [Worker.recognize](#worker-recognize) | ||
- [Worker.detect](#worker-detect) | ||
- [Worker.terminate](#worker-terminate) | ||
- [createScheduler()](#create-scheduler) | ||
@@ -31,6 +30,9 @@ - [Scheduler.addWorker](#scheduler-add-worker) | ||
createWorker is a factory function that creates a tesseract worker, a worker is basically a Web Worker in browser and Child Process in Node. | ||
`createWorker` is a function that creates a Tesseract.js worker. A Tesseract.js worker is an object that creates and manages an instance of Tesseract running in a web worker (browser) or worker thread (Node.js). Once created, OCR jobs are sent through the worker. | ||
**Arguments:** | ||
- `langs` a string to indicate the languages traineddata to download, multiple languages are concated with **+**, ex: **eng+chi\_tra** | ||
- `oem` a enum to indicate the OCR Engine Mode you use | ||
- `options` an object of customized options | ||
@@ -48,2 +50,4 @@ - `corePath` path to a directory containing **both** `tesseract-core.wasm.js` and `tesseract-core-simd.wasm.js` from [Tesseract.js-core](https://www.npmjs.com/package/tesseract.js-core) package | ||
- none: not to read cache and not to write back | ||
- `legacyCore` set to `true` to ensure any code downloaded supports the Legacy model (in addition to LSTM model) | ||
- `legacyLang` set to `true` to ensure any language data downloaded supports the Legacy model (in addition to LSTM model) | ||
- `workerBlobURL` a boolean to define whether to use Blob URL for worker script, default: true | ||
@@ -65,54 +69,68 @@ - `gzip` a boolean to define whether the traineddata from the remote is gzipped, default: true | ||
## Worker | ||
<a name="worker-recognize"></a> | ||
### Worker.recognize(image, options, jobId): Promise | ||
A Worker helps you to do the OCR related tasks, it takes few steps to setup Worker before it is fully functional. The full flow is: | ||
Worker.recognize() provides core function of Tesseract.js as it executes OCR | ||
- FS functions // optional | ||
- loadLanguage | ||
- initialize | ||
- setParameters // optional | ||
- recognize or detect | ||
- terminate | ||
Figures out what words are in `image`, where the words are in `image`, etc. | ||
> Note: `image` should be sufficiently high resolution. | ||
> Often, the same image will get much better results if you upscale it before calling `recognize`. | ||
Each function is async, so using async/await or Promise is required. When it is resolved, you get an object: | ||
```json | ||
{ | ||
"jobId": "Job-1-123", | ||
"data": { ... } | ||
} | ||
``` | ||
jobId is generated by Tesseract.js, but you can put your own when calling any of the function above. | ||
<a name="worker-writeText"></a> | ||
### Worker.writeText(path, text, jobId): Promise | ||
Worker.writeText() writes a text file to the path specified in MEMFS, it is useful when you want to use some features that requires tesseract.js | ||
to read file from file system. | ||
**Arguments:** | ||
- `path` text file path | ||
- `text` content of the text file | ||
- `image` see [Image Format](./image-format.md) for more details. | ||
- `options` an object of customized options | ||
- `rectangle` an object to specify the regions you want to recognized in the image, should contain top, left, width and height, see example below. | ||
- `output` an object specifying which output formats to return (by default `text`, `blocks`, `hocr`, and `tsv` are returned) | ||
- `jobId` Please see details above | ||
**Output:** | ||
**Examples:** | ||
```javascript | ||
const { createWorker } = Tesseract; | ||
(async () => { | ||
await worker.writeText('tmp.txt', 'Hi\nTesseract.js\n'); | ||
const worker = await createWorker('eng'); | ||
const { data: { text } } = await worker.recognize(image); | ||
console.log(text); | ||
})(); | ||
``` | ||
<a name="worker-readText"></a> | ||
### Worker.readText(path, jobId): Promise | ||
With rectangle | ||
Worker.readText() reads a text file to the path specified in MEMFS, it is useful when you want to check the content. | ||
```javascript | ||
const { createWorker } = Tesseract; | ||
(async () => { | ||
const worker = await createWorker('eng'); | ||
const { data: { text } } = await worker.recognize(image, { | ||
rectangle: { top: 0, left: 0, width: 100, height: 100 }, | ||
}); | ||
console.log(text); | ||
})(); | ||
``` | ||
<a name="worker-set-parameters"></a> | ||
### worker.setParameters(params, jobId): Promise | ||
`worker.setParameters()` set parameters for Tesseract API (using SetVariable()), it changes the behavior of Tesseract and some parameters like tessedit\_char\_whitelist is very useful. | ||
**Arguments:** | ||
- `path` text file path | ||
- `params` an object with key and value of the parameters | ||
- `jobId` Please see details above | ||
Note: `worker.setParameters` cannot be used to change the `oem`, as this value is set at initialization. `oem` is initially set using an argument of `createWorker`. After a worker already exists, changing `oem` requires running `worker.reinitialize`. | ||
**Useful Parameters:** | ||
| name | type | default value | description | | ||
| --------------------------- | ------ | ----------------- | ------------------------------------------------------------------------------------------------------------------------------- | | ||
| tessedit\_pageseg\_mode | enum | PSM.SINGLE\_BLOCK | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L163) for definition of each mode | | ||
| tessedit\_char\_whitelist | string | '' | setting white list characters makes the result only contains these characters, useful if content in image is limited | | ||
| preserve\_interword\_spaces | string | '0' | '0' or '1', keeps the space between words | | ||
| user\_defined\_dpi | string | '' | Define custom dpi, use to fix **Warning: Invalid resolution 0 dpi. Using 70 instead.** | | ||
This list is incomplete. As Tesseract.js passes parameters to the Tesseract engine, all parameters supported by the underlying version of Tesseract should also be supported by Tesseract.js. (Note that parameters marked as “init only” in Tesseract documentation cannot be set by `setParameters` or `recognize`.) | ||
**Examples:** | ||
@@ -122,34 +140,37 @@ | ||
(async () => { | ||
const { data } = await worker.readText('tmp.txt'); | ||
console.log(data); | ||
})(); | ||
await worker.setParameters({ | ||
tessedit_char_whitelist: '0123456789', | ||
}); | ||
}) | ||
``` | ||
<a name="worker-removeFile"></a> | ||
### Worker.removeFile(path, jobId): Promise | ||
<a name="worker-reinitialize"></a> | ||
### worker.reinitialize(langs, oem, jobId): Promise | ||
Worker.removeFile() remove a file in MEMFS, it is useful when you want to free the memory. | ||
`worker.reinitialize()` re-initializes an existing Tesseract.js worker with different `langs` and `oem` arguments. | ||
**Arguments:** | ||
- `path` file path | ||
- `langs` a string to indicate the languages traineddata to download, multiple languages are concated with **+**, ex: **eng+chi\_tra** | ||
- `oem` a enum to indicate the OCR Engine Mode you use | ||
- `jobId` Please see details above | ||
Note: to switch from Tesseract LSTM (`oem` value `1`) to Tesseract Legacy (`oem` value `0`) using `worker.reinitialize()`, the worker must already contain the code required to run the Tesseract Legacy model. Setting `legacyCore: true` and `legacyLang: true` in `createWorker` options ensures this is the case. | ||
**Examples:** | ||
```javascript | ||
(async () => { | ||
await worker.removeFile('tmp.txt'); | ||
})(); | ||
await worker.reinitialize('eng', 1); | ||
``` | ||
<a name="worker-FS"></a> | ||
### Worker.FS(method, args, jobId): Promise | ||
<a name="worker-detect"></a> | ||
### Worker.detect(image, jobId): Promise | ||
Worker.FS() is a generic FS function to do anything you want, you can check [HERE](https://emscripten.org/docs/api_reference/Filesystem-API.html) for all functions. | ||
Worker.detect() does OSD (Orientation and Script Detection) to the image instead of OCR. | ||
Note: Running `worker.detect` requires a worker with code and language data that supports Tesseract Legacy (this is not enabled by default). If you want to run `worker.detect`, set `legacyCore` and `legacyLang` to `true` in the `createWorker` options. | ||
**Arguments:** | ||
- `method` method name | ||
- `args` array of arguments to pass | ||
- `image` see [Image Format](./image-format.md) for more details. | ||
- `jobId` Please see details above | ||
@@ -160,36 +181,32 @@ | ||
```javascript | ||
const { createWorker } = Tesseract; | ||
(async () => { | ||
await worker.FS('writeFile', ['tmp.txt', 'Hi\nTesseract.js\n']); | ||
// equal to: | ||
// await worker.writeText('tmp.txt', 'Hi\nTesseract.js\n'); | ||
const worker = await createWorker('eng', 1, {legacyCore: true, legacyLang: true}); | ||
const { data } = await worker.detect(image); | ||
console.log(data); | ||
})(); | ||
``` | ||
<a name="worker-load-language"></a> | ||
### Worker.loadLanguage(langs, jobId): Promise | ||
<a name="worker-terminate"></a> | ||
### Worker.terminate(jobId): Promise | ||
Worker.loadLanguage() loads traineddata from cache or download traineddata from remote, and put traineddata into the WebAssembly file system. | ||
Worker.terminate() terminates the worker and cleans up | ||
**Arguments:** | ||
- `langs` a string to indicate the languages traineddata to download, multiple languages are concated with **+**, ex: **eng+chi\_tra** | ||
- `jobId` Please see details above | ||
**Examples:** | ||
```javascript | ||
(async () => { | ||
await worker.loadLanguage('eng+chi_tra'); | ||
await worker.terminate(); | ||
})(); | ||
``` | ||
<a name="worker-initialize"></a> | ||
### Worker.initialize(langs, oem, jobId): Promise | ||
Worker.initialize() initializes the Tesseract API, make sure it is ready for doing OCR tasks. | ||
<a name="worker-writeText"></a> | ||
### Worker.writeText(path, text, jobId): Promise | ||
Worker.writeText() writes a text file to the path specified in MEMFS, it is useful when you want to use some features that requires tesseract.js | ||
to read file from file system. | ||
**Arguments:** | ||
- `langs` a string to indicate the languages loaded by Tesseract API, it can be the subset of the languauge traineddata you loaded from Worker.loadLanguage. | ||
- `oem` a enum to indicate the OCR Engine Mode you use | ||
- `path` text file path | ||
- `text` content of the text file | ||
- `jobId` Please see details above | ||
@@ -201,30 +218,16 @@ | ||
(async () => { | ||
/** You can load more languages in advance, but use only part of them in Worker.initialize() */ | ||
await worker.loadLanguage('eng+chi_tra'); | ||
await worker.initialize('eng'); | ||
await worker.writeText('tmp.txt', 'Hi\nTesseract.js\n'); | ||
})(); | ||
``` | ||
<a name="worker-set-parameters"></a> | ||
### Worker.setParameters(params, jobId): Promise | ||
Worker.setParameters() set parameters for Tesseract API (using SetVariable()), it changes the behavior of Tesseract and some parameters like tessedit\_char\_whitelist is very useful. | ||
<a name="worker-readText"></a> | ||
### Worker.readText(path, jobId): Promise | ||
Worker.readText() reads a text file to the path specified in MEMFS, it is useful when you want to check the content. | ||
**Arguments:** | ||
- `params` an object with key and value of the parameters | ||
- `path` text file path | ||
- `jobId` Please see details above | ||
**Useful Parameters:** | ||
| name | type | default value | description | | ||
| --------------------------- | ------ | ----------------- | ------------------------------------------------------------------------------------------------------------------------------- | | ||
| tessedit\_ocr\_engine\_mode | enum | OEM.DEFAULT | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L268) for definition of each mode | | ||
| tessedit\_pageseg\_mode | enum | PSM.SINGLE\_BLOCK | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L163) for definition of each mode | | ||
| tessedit\_char\_whitelist | string | '' | setting white list characters makes the result only contains these characters, useful if content in image is limited | | ||
| preserve\_interword\_spaces | string | '0' | '0' or '1', keeps the space between words | | ||
| user\_defined\_dpi | string | '' | Define custom dpi, use to fix **Warning: Invalid resolution 0 dpi. Using 70 instead.** | | ||
This list is incomplete. As Tesseract.js passes parameters to the Tesseract engine, all parameters supported by the underlying version of Tesseract should also be supported by Tesseract.js. (Note that parameters marked as “init only” in Tesseract documentation cannot be set by `setParameters` or `recognize`.) | ||
**Examples:** | ||
@@ -234,63 +237,34 @@ | ||
(async () => { | ||
await worker.setParameters({ | ||
tessedit_char_whitelist: '0123456789', | ||
}); | ||
}) | ||
const { data } = await worker.readText('tmp.txt'); | ||
console.log(data); | ||
})(); | ||
``` | ||
<a name="worker-recognize"></a> | ||
### Worker.recognize(image, options, jobId): Promise | ||
<a name="worker-removeFile"></a> | ||
### Worker.removeFile(path, jobId): Promise | ||
Worker.recognize() provides core function of Tesseract.js as it executes OCR | ||
Worker.removeFile() remove a file in MEMFS, it is useful when you want to free the memory. | ||
Figures out what words are in `image`, where the words are in `image`, etc. | ||
> Note: `image` should be sufficiently high resolution. | ||
> Often, the same image will get much better results if you upscale it before calling `recognize`. | ||
**Arguments:** | ||
- `image` see [Image Format](./image-format.md) for more details. | ||
- `options` an object of customized options | ||
- `rectangle` an object to specify the regions you want to recognized in the image, should contain top, left, width and height, see example below. | ||
- `output` an object specifying which output formats to return (by default `text`, `blocks`, `hocr`, and `tsv` are returned) | ||
- `path` file path | ||
- `jobId` Please see details above | ||
**Output:** | ||
**Examples:** | ||
```javascript | ||
const { createWorker } = Tesseract; | ||
(async () => { | ||
const worker = await createWorker(); | ||
await worker.loadLanguage('eng'); | ||
await worker.initialize('eng'); | ||
const { data: { text } } = await worker.recognize(image); | ||
console.log(text); | ||
await worker.removeFile('tmp.txt'); | ||
})(); | ||
``` | ||
With rectangle | ||
<a name="worker-FS"></a> | ||
### Worker.FS(method, args, jobId): Promise | ||
```javascript | ||
const { createWorker } = Tesseract; | ||
(async () => { | ||
const worker = await createWorker(); | ||
await worker.loadLanguage('eng'); | ||
await worker.initialize('eng'); | ||
const { data: { text } } = await worker.recognize(image, { | ||
rectangle: { top: 0, left: 0, width: 100, height: 100 }, | ||
}); | ||
console.log(text); | ||
})(); | ||
``` | ||
Worker.FS() is a generic FS function to do anything you want, you can check [HERE](https://emscripten.org/docs/api_reference/Filesystem-API.html) for all functions. | ||
<a name="worker-detect"></a> | ||
### Worker.detect(image, jobId): Promise | ||
Worker.detect() does OSD (Orientation and Script Detection) to the image instead of OCR. | ||
**Arguments:** | ||
- `image` see [Image Format](./image-format.md) for more details. | ||
- `method` method name | ||
- `args` array of arguments to pass | ||
- `jobId` Please see details above | ||
@@ -301,23 +275,9 @@ | ||
```javascript | ||
const { createWorker } = Tesseract; | ||
(async () => { | ||
const worker = await createWorker(); | ||
await worker.loadLanguage('eng'); | ||
await worker.initialize('eng'); | ||
const { data } = await worker.detect(image); | ||
console.log(data); | ||
await worker.FS('writeFile', ['tmp.txt', 'Hi\nTesseract.js\n']); | ||
// equal to: | ||
// await worker.writeText('tmp.txt', 'Hi\nTesseract.js\n'); | ||
})(); | ||
``` | ||
<a name="worker-terminate"></a> | ||
### Worker.terminate(jobId): Promise | ||
Worker.terminate() terminates the worker and cleans up | ||
```javascript | ||
(async () => { | ||
await worker.terminate(); | ||
})(); | ||
``` | ||
<a name="create-scheduler"></a> | ||
@@ -416,4 +376,6 @@ ## createScheduler(): Scheduler | ||
recognize() is a function to quickly do recognize() task, it is not recommended to use in real application, but useful when you want to save some time. | ||
This function is depreciated and should be replaced with `worker.recognize` (see above). | ||
`recognize` works the same as `worker.recognize`, except that a new worker is created, loaded, and destroyed every time the function is called. | ||
See [Tesseract.js](../src/Tesseract.js) | ||
@@ -424,2 +386,4 @@ | ||
This function is depreciated and should be replaced with `worker.detect` (see above). | ||
Same background as recognize(), but it does detect instead. | ||
@@ -426,0 +390,0 @@ |
@@ -10,7 +10,5 @@ # Tesseract.js Examples | ||
const worker = await createWorker(); | ||
const worker = await createWorker('eng'); | ||
(async () => { | ||
await worker.loadLanguage('eng'); | ||
await worker.initialize('eng'); | ||
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png'); | ||
@@ -27,3 +25,3 @@ console.log(text); | ||
const worker = await createWorker({ | ||
const worker = await createWorker('eng', 1, { | ||
logger: m => console.log(m), // Add logger here | ||
@@ -33,4 +31,2 @@ }); | ||
(async () => { | ||
await worker.loadLanguage('eng'); | ||
await worker.initialize('eng'); | ||
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png'); | ||
@@ -47,7 +43,5 @@ console.log(text); | ||
const worker = await createWorker(); | ||
const worker = await createWorker('eng+chi_tra'); | ||
(async () => { | ||
await worker.loadLanguage('eng+chi_tra'); | ||
await worker.initialize('eng+chi_tra'); | ||
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png'); | ||
@@ -63,7 +57,5 @@ console.log(text); | ||
const worker = await createWorker(); | ||
const worker = await createWorker('eng'); | ||
(async () => { | ||
await worker.loadLanguage('eng'); | ||
await worker.initialize('eng'); | ||
await worker.setParameters({ | ||
@@ -85,7 +77,5 @@ tessedit_char_whitelist: '0123456789', | ||
const worker = await createWorker(); | ||
const worker = await createWorker('eng'); | ||
(async () => { | ||
await worker.loadLanguage('eng'); | ||
await worker.initialize('eng'); | ||
await worker.setParameters({ | ||
@@ -114,8 +104,6 @@ tessedit_pageseg_mode: PSM.SINGLE_BLOCK, | ||
const worker = await createWorker(); | ||
const worker = await createWorker('eng'); | ||
const rectangle = { left: 0, top: 0, width: 500, height: 250 }; | ||
(async () => { | ||
await worker.loadLanguage('eng'); | ||
await worker.initialize('eng'); | ||
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png', { rectangle }); | ||
@@ -132,3 +120,3 @@ console.log(text); | ||
const worker = await createWorker(); | ||
const worker = await createWorker('eng'); | ||
const rectangles = [ | ||
@@ -150,4 +138,2 @@ { | ||
(async () => { | ||
await worker.loadLanguage('eng'); | ||
await worker.initialize('eng'); | ||
const values = []; | ||
@@ -169,4 +155,4 @@ for (let i = 0; i < rectangles.length; i++) { | ||
const scheduler = createScheduler(); | ||
const worker1 = await createWorker(); | ||
const worker2 = await createWorker(); | ||
const worker1 = await createWorker('eng'); | ||
const worker2 = await createWorker('eng'); | ||
const rectangles = [ | ||
@@ -188,6 +174,2 @@ { | ||
(async () => { | ||
await worker1.loadLanguage('eng'); | ||
await worker2.loadLanguage('eng'); | ||
await worker1.initialize('eng'); | ||
await worker2.initialize('eng'); | ||
scheduler.addWorker(worker1); | ||
@@ -209,10 +191,6 @@ scheduler.addWorker(worker2); | ||
const scheduler = createScheduler(); | ||
const worker1 = await createWorker(); | ||
const worker2 = await createWorker(); | ||
const worker1 = await createWorker('eng'); | ||
const worker2 = await createWorker('eng'); | ||
(async () => { | ||
await worker1.loadLanguage('eng'); | ||
await worker2.loadLanguage('eng'); | ||
await worker1.initialize('eng'); | ||
await worker2.initialize('eng'); | ||
scheduler.addWorker(worker1); | ||
@@ -219,0 +197,0 @@ scheduler.addWorker(worker2); |
@@ -22,4 +22,2 @@ FAQ | ||
The language model is downloaded by `worker.loadLanguage()` and you need to pass the langs to `worker.initialize()`. | ||
During the downloading of language model, Tesseract.js will first check if \*.traineddata already exists. (browser: [IndexedDB](https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API), Node.js: fs, in the folder you execute the command) If the \*.traineddata doesn't exist, it will fetch \*.traineddata.gz from [tessdata](https://github.com/naptha/tessdata), ungzip and store in IndexedDB or fs, you can delete it manually and it will download again for you. | ||
@@ -26,0 +24,0 @@ |
@@ -12,16 +12,6 @@ ## Local Installation | ||
```javascript | ||
Tesseract.recognize(image, langs, { | ||
workerPath: 'https://cdn.jsdelivr.net/npm/tesseract.js@v4.0.3/dist/worker.min.js', | ||
langPath: 'https://tessdata.projectnaptha.com/4.0.0', | ||
corePath: 'https://cdn.jsdelivr.net/npm/tesseract.js-core@v4.0.3', | ||
}) | ||
``` | ||
Or | ||
```javascript | ||
const worker = await createWorker({ | ||
workerPath: 'https://cdn.jsdelivr.net/npm/tesseract.js@v4.0.3/dist/worker.min.js', | ||
workerPath: 'https://cdn.jsdelivr.net/npm/tesseract.js@v5.0.0/dist/worker.min.js', | ||
langPath: 'https://tessdata.projectnaptha.com/4.0.0', | ||
corePath: 'https://cdn.jsdelivr.net/npm/tesseract.js-core@v4.0.3', | ||
corePath: 'https://cdn.jsdelivr.net/npm/tesseract.js-core@v5.0.0', | ||
}); | ||
@@ -34,9 +24,16 @@ ``` | ||
### langPath | ||
A string specifying the location of the tesseract language files, with default value 'https://tessdata.projectnaptha.com/4.0.0'. Language file URLs are calculated according to the formula `langPath + langCode + '.traineddata.gz'`. | ||
A string specifying the location of the tesseract language files. Language file URLs are calculated according to the formula `langPath + langCode + '.traineddata.gz'`. If `langPath` is not specified by the user, then the correct language data will be automatically downloaded from the jsDelivr CDN. | ||
### corePath | ||
A string specifying the location of the [tesseract.js-core](https://github.com/naptha/tesseract.js-core) files, with default value 'https://cdn.jsdelivr.net/npm/tesseract.js-core@v4.0.3'. | ||
A string specifying the location of the [tesseract.js-core](https://github.com/naptha/tesseract.js-core) files, with default value 'https://cdn.jsdelivr.net/npm/tesseract.js-core@v5.0.0'. | ||
`corePath` should be set to a directory containing both `tesseract-core-simd.wasm.js` and `tesseract-core.wasm.js`. Tesseract.js will load either `tesseract-core-simd.wasm.js` or `tesseract-core.wasm.js` from the directory depending on whether the users' device supports SIMD (see [https://webassembly.org/roadmap/](https://webassembly.org/roadmap/)). | ||
If you set the `corePath` argument, be sure to set it to a directory that contains **all 4** of these files: | ||
To avoid breaking old code, when `corePath` is set to a specific `.js` file (e.g. `https://cdn.jsdelivr.net/npm/tesseract.js-core@v4.0.3/tesseract-core.wasm.js`), it will load that file regardless of whether the users' device supports SIMD or not. This behavior only exists to preserve backwards compatibility—setting `corePath` to a specific `.js` file is strongly discouraged. Doing so will either result in much slower performance (if `tesseract-core.wasm.js` is specified) or failure to run on certain devices (if `tesseract-core-simd.wasm.js` is specified). | ||
1. `tesseract-core.wasm.js` | ||
2. `tesseract-core-simd.wasm.js` | ||
3. `tesseract-core-lstm.wasm.js` | ||
4. `tesseract-core-simd-lstm.wasm.js` | ||
Tesseract.js will pick the correct file based on your users' device and the `createWorker` options. | ||
To avoid breaking old code, when `corePath` is set to a specific `.js` file (e.g. `https://cdn.jsdelivr.net/npm/tesseract.js-core@v5.0.0/tesseract-core.wasm.js`), it will load that file regardless of whether the users' device supports SIMD or not. This behavior only exists to preserve backwards compatibility—setting `corePath` to a specific `.js` file is strongly discouraged. Doing so will either result in much slower performance (if `tesseract-core.wasm.js` is specified) or failure to run on certain devices (if `tesseract-core-simd.wasm.js` is specified). |
@@ -5,11 +5,11 @@ # Overview | ||
# Reducing Setup Time | ||
Within certain applications, the majority of runtime may be attributable to setup steps (`createWorker`, `worker.initialize`, and `worker.loadLanguage`) rather than recognition (`worker.recognize`). Implementing the strategies in this section should reduce the time spent on these steps. | ||
Within certain applications, the majority of runtime may be attributable to setup steps (`createWorker`) rather than recognition (`worker.recognize`). Implementing the strategies in this section should reduce the time spent on these steps. | ||
Notably, the time spent on setup for first-time users may not be apparent to developers, as Tesseract.js caches language data after it is downloaded for the first time. To experience Tesseract.js as a first-time user, set `cacheMethod: 'none'` in the [createWorker options](./api.md#createworkeroptions-worker). Be sure to remove this setting before publishing your app. | ||
### Reuse Workers | ||
When recognizing multiple images, some users will create/load/destroy a new worker for each image. This is never the correct option. If the images are being recognized one after the other, all of the extra `createWorker`/`worker.initialize`/`worker.loadLanguage` steps are wasted runtime, as `worker.recognize` could be run with the same `worker`. Workers do not break after one use. | ||
When recognizing multiple images, some users will create/load/destroy a new worker for each image. This is never the correct option. If the images are being recognized one after the other, all of the extra steps required to create/load/destroy a new worker are wasted runtime, as `worker.recognize` could be run with the same `worker`. Workers do not break after one use. | ||
Alternatively, if images are being recognized in parallel, then creating a new worker for each recognition job is likely to cause crashes due to resource limitations. As each Tesseract.js worker uses a high amount of memory, code should never be able to create an arbitrary number of `workers`. Instead, schedulers should be used to create a specific pool for workers (say, 4 workers), and then jobs assigned through the scheduler. | ||
### Set Up Workers Ahead of Time | ||
Rather than waiting until the last minute to load code and data, you can set up a worker ahead of time. Doing so greatly reduces runtime the first time a user run recognition. This requires managing workers rather than using `Tesseract.recognize`, which is explained [here](./intro.md). An example where a worker is prepared ahead of time can be found [here](../examples/browser/basic-efficient.html). | ||
Rather than waiting until the last minute to load code and data, you can set up a worker ahead of time. Doing so greatly reduces runtime the first time a user run recognition. An example where a worker is prepared ahead of time can be found [here](../examples/browser/basic-efficient.html). | ||
@@ -19,11 +19,6 @@ The appropriate time to load Tesseract.js workers and data is application-specific. For example, if you have an web app where only 5% of users need OCR, it likely does not make sense to download ~15MB in code and data upon a page load. You could consider instead loading Tesseract.js when a user indicates they want to perform OCR, but before they select a specific image. | ||
### Do Not Disable Language Data Caching | ||
Language data is, by far, the largest download required to run Tesseract.js. The default `eng.traineddata` file is 10.4MB compressed. The default `chi_sim.traineddata` file is 19.2MB compressed. | ||
Language data is one of the largest downloads required to run Tesseract.js. While most language data files (including the default English file) are ~2MB, in a worst-case scenario they can be much larger. For example, setting the recognition model (`oem`) to Tesseract Legacy and language to Chinese (simplified) results in a ~20MB file being downloaded. | ||
To avoid downloading language data multiple times, Tesseract.js caches `.traineddata` files. In past versions of Tesseract.js, this caching behavior contained bugs, so some users disabled it (setting `cacheMethod: 'none'` or `cacheMethod: 'refresh'`). As these bugs were fixed in [v4.0.6](https://github.com/naptha/tesseract.js/releases/tag/v4.0.6), it is now recommended that users use the default `cacheMethod` value (i.e. just ignore the `cacheMethod` argument). | ||
### Consider Using Smaller Language Data | ||
The default language data used by Tesseract.js includes data for both Tesseract engines (LSTM [the default model] and Legacy), and is optimized for quality rather than speed. Both the inclusion of multiple models and the focus on quality increase the size of the language data. Setting a non-default `langData` path may result in significantly smaller files being downloaded. | ||
For example, by changing `langPath` from the default (`https://tessdata.projectnaptha.com/4.0.0`) to `https://tessdata.projectnaptha.com/4.0.0_fast` the size of the compressed English language data is reduced from 10.9MB to 2.0MB. Note that this language data (1) only supports the default LSTM model and (2) is optimized for size/speed rather than quality, so users should switch only after testing whether this data works for their application. | ||
# Reducing Recognition Runtime | ||
@@ -34,9 +29,13 @@ | ||
### Consider Using the Legacy Model | ||
In general, the LSTM (default) recognition model provides the best quality. However, the Legacy model generally runs faster, and depending on your application, may provide sufficient recognition quality. If runtime is a significant concern, consider experimenting with the Legacy model (by setting `oem` to `”0”` within `worker.initialize`). As a rule of thumb, the Legacy model is usually viable when the input data is high-quality (high-definition screenshots, document scans, etc.). | ||
### Do Not Set `corePath` to a Single `.js` file | ||
If you set the `corePath` argument, be sure to set it to a directory that contains **all 4** of these files: | ||
1. `tesseract-core.wasm.js` | ||
2. `tesseract-core-simd.wasm.js` | ||
3. `tesseract-core-lstm.wasm.js` | ||
4. `tesseract-core-simd-lstm.wasm.js` | ||
Tesseract.js needs to be able to pick between these files—setting `corePath` to a specific `.js` file will significantly degrade performance or compatibility. | ||
### Consider Using "Fast" Language Data | ||
By default, Tesseract.js uses language data that is optimized for quality rather than speed. You can also experiment with using language data that is optimized for speed by setting `langPath` to `https://tessdata.projectnaptha.com/4.0.0_fast`. | ||
### Do Not Set `corePath` to a Single `.js` file | ||
If you set the `corePath` argument, be sure to set it to a directory that contains both `tesseract-core.wasm.js` or `tesseract-core-simd.wasm.js`. Tesseract.js needs to be able to pick between both files—setting `corePath` to a specific `.js` file will significantly degrade performance or compatibility. See [this comment](https://github.com/naptha/tesseract.js/issues/735#issuecomment-1519157646) for explanation. | ||
By default, Tesseract.js uses language data that is optimized for quality rather than speed. You can also experiment with using language data that is optimized for speed by setting `langPath` to `https://tessdata.projectnaptha.com/4.0.0_fast`. We have not benchmarked the impact this has on performance or accuracy, so feel free to open a Git Issue if you do so and wish to share results. |
@@ -13,4 +13,2 @@ #!/usr/bin/env node | ||
const worker = await createWorker(); | ||
await worker.loadLanguage('eng'); | ||
await worker.initialize('eng'); | ||
const { data: { text, pdf } } = await worker.recognize(image, {pdfTitle: "Example PDF"}, {pdf: true}); | ||
@@ -17,0 +15,0 @@ console.log(text); |
@@ -24,4 +24,2 @@ #!/usr/bin/env node | ||
const worker = await createWorker(); | ||
await worker.loadLanguage('eng'); | ||
await worker.initialize('eng'); | ||
const { data: { imageColor, imageGrey, imageBinary } } = await worker.recognize(image, {rotateAuto: true}, {imageColor: true, imageGrey: true, imageBinary: true}); | ||
@@ -28,0 +26,0 @@ |
@@ -11,7 +11,5 @@ #!/usr/bin/env node | ||
(async () => { | ||
const worker = await createWorker({ | ||
const worker = await createWorker("eng", 1, { | ||
logger: m => console.log(m), | ||
}); | ||
await worker.loadLanguage('eng'); | ||
await worker.initialize('eng'); | ||
const { data: { text } } = await worker.recognize(image); | ||
@@ -18,0 +16,0 @@ console.log(text); |
const { createWorker, createScheduler } = require('../../'); | ||
const path = require('path'); | ||
const [,, imagePath] = process.argv; | ||
// Note: This example recognizes the same image 4 times in parallel | ||
// to show how schedulers can be used to speed up bulk jobs. | ||
// In actual use you would (obviously) not want to run multiple identical jobs. | ||
const image = path.resolve(__dirname, (imagePath || '../../tests/assets/images/cosmic.png')); | ||
const imageArr = [image, image, image, image]; | ||
const scheduler = createScheduler(); | ||
@@ -7,5 +16,3 @@ | ||
const workerGen = async () => { | ||
const worker = await createWorker({cachePath: "."}); | ||
await worker.loadLanguage('eng'); | ||
await worker.initialize('eng'); | ||
const worker = await createWorker("eng", 1, {cachePath: "."}); | ||
scheduler.addWorker(worker); | ||
@@ -18,10 +25,15 @@ } | ||
for (let i=0; i<workerN; i++) { | ||
resArr[i] = await workerGen(); | ||
resArr[i] = workerGen(); | ||
} | ||
await Promise.all(resArr); | ||
/** Add 4 recognition jobs */ | ||
const results = await Promise.all(Array(10).fill(0).map(() => ( | ||
scheduler.addJob('recognize', 'https://tesseract.projectnaptha.com/img/eng_bw.png').then((x) => console.log(x.data.text)) | ||
))) | ||
const resArr2 = Array(imageArr.length); | ||
for (let i = 0; i < imageArr.length; i++) { | ||
resArr2[i] = scheduler.addJob('recognize', image).then((x) => console.log(x.data.text)); | ||
} | ||
await Promise.all(resArr2); | ||
await scheduler.terminate(); // It also terminates all workers. | ||
})(); |
{ | ||
"name": "tesseract.js", | ||
"version": "4.1.4", | ||
"version": "5.0.0", | ||
"description": "Pure Javascript Multilingual OCR", | ||
@@ -15,3 +15,3 @@ "main": "src/index.js", | ||
"prepublishOnly": "npm run build", | ||
"wait": "rimraf dist && wait-on http://localhost:3000/dist/tesseract.dev.js", | ||
"wait": "rimraf dist && wait-on http://localhost:3000/dist/tesseract.min.js", | ||
"test": "npm-run-all -p -r start test:all", | ||
@@ -73,3 +73,3 @@ "test:all": "npm-run-all wait test:browser:* test:node:all", | ||
"regenerator-runtime": "^0.13.3", | ||
"tesseract.js-core": "^4.0.4", | ||
"tesseract.js-core": "^5.0.0", | ||
"wasm-feature-detect": "^1.2.11", | ||
@@ -76,0 +76,0 @@ "zlibjs": "^0.3.1" |
111
README.md
@@ -34,78 +34,28 @@ <p align="center"> | ||
Tesseract.js wraps a [webassembly port](https://github.com/naptha/tesseract.js-core) of the [Tesseract](https://github.com/tesseract-ocr/tesseract) OCR Engine. | ||
It works in the browser using [webpack](https://webpack.js.org/) or plain script tags with a [CDN](#CDN) and on the server with [Node.js](https://nodejs.org/en/). | ||
It works in the browser using [webpack](https://webpack.js.org/), esm, or plain script tags with a [CDN](#CDN) and on the server with [Node.js](https://nodejs.org/en/). | ||
After you [install it](#installation), using it is as simple as: | ||
```javascript | ||
import Tesseract from 'tesseract.js'; | ||
Tesseract.recognize( | ||
'https://tesseract.projectnaptha.com/img/eng_bw.png', | ||
'eng', | ||
{ logger: m => console.log(m) } | ||
).then(({ data: { text } }) => { | ||
console.log(text); | ||
}) | ||
``` | ||
Or using workers (recommended for production use): | ||
```javascript | ||
import { createWorker } from 'tesseract.js'; | ||
const worker = await createWorker({ | ||
logger: m => console.log(m) | ||
}); | ||
(async () => { | ||
await worker.loadLanguage('eng'); | ||
await worker.initialize('eng'); | ||
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png'); | ||
console.log(text); | ||
const worker = await createWorker('eng'); | ||
const data = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png'); | ||
console.log(data.text); | ||
await worker.terminate(); | ||
})(); | ||
``` | ||
When recognizing multiple images, users should create a worker once, run `worker.recognize` for each image, and then run `worker.terminate()` once at the end (rather than running the above snippet for every image). | ||
For a basic overview of the functions, including the pros/cons of different approaches, see the [intro](./docs/intro.md). [Check out the docs](#documentation) for a full explanation of the API. | ||
## Major changes in v4 | ||
Version 4 includes many new features and bug fixes--see [this issue](https://github.com/naptha/tesseract.js/issues/662) for a full list. Several highlights are below. | ||
- Added rotation preprocessing options (including auto-rotate) for significantly better accuracy | ||
- Processed images (rotated, grayscale, binary) can now be retrieved | ||
- Improved support for parallel processing (schedulers) | ||
- Breaking changes: | ||
- `createWorker` is now async | ||
- `getPDF` function replaced by `pdf` recognize option | ||
## Major changes in v3 | ||
- Significantly faster performance | ||
- Runtime reduction of 84% for Browser and 96% for Node.js when recognizing the [example images](./examples/data) | ||
- Upgrade to Tesseract v5.1.0 (using emscripten 3.1.18) | ||
- Added SIMD-enabled build for supported devices | ||
- Added support: | ||
- Node.js version 18 | ||
- Removed support: | ||
- ASM.js version, any other old versions of Tesseract.js-core (<3.0.0) | ||
- Node.js versions 10 and 12 | ||
## Major changes in v2 | ||
- Upgrade to tesseract v4.1.1 (using emscripten 1.39.10 upstream) | ||
- Support multiple languages at the same time, eg: eng+chi\_tra for English and Traditional Chinese | ||
- Supported image formats: png, jpg, bmp, pbm | ||
- Support WebAssembly (fallback to ASM.js when browser doesn't support) | ||
- Support Typescript | ||
Read a story about v2: <a href="https://jeromewu.github.io/why-i-refactor-tesseract.js-v2/">Why I refactor tesseract.js v2?</a><br> | ||
Check the <a href="https://github.com/naptha/tesseract.js/tree/support/1.x">support/1.x</a> branch for version 1 | ||
## Installation | ||
Tesseract.js works with a `<script>` tag via local copy or CDN, with webpack via `npm` and on Node.js with `npm/yarn`. | ||
### CDN | ||
```html | ||
<!-- v4 --> | ||
<script src='https://cdn.jsdelivr.net/npm/tesseract.js@4/dist/tesseract.min.js'></script> | ||
<!-- v5 --> | ||
<script src='https://cdn.jsdelivr.net/npm/tesseract.js@5/dist/tesseract.min.js'></script> | ||
``` | ||
After including the script the `Tesseract` variable will be globally available. | ||
After including the script the `Tesseract` variable will be globally available and a worker can be created using `Tesseract.createWorker`. | ||
Alternatively, an ESM build (used with `import` syntax) can be found at `https://cdn.jsdelivr.net/npm/tesseract.js@5/dist/tesseract.esm.min.js`. | ||
@@ -126,8 +76,7 @@ ### Node.js | ||
## Documentation | ||
* [Intro](./docs/intro.md) | ||
* [Workers vs. Schedulers](./docs/workers_vs_schedulers.md) | ||
* [Examples](./docs/examples.md) | ||
* [Image Format](./docs/image-format.md) | ||
* [Supported Image Formats](./docs/image-format.md) | ||
* [API](./docs/api.md) | ||
@@ -137,2 +86,38 @@ * [Local Installation](./docs/local-installation.md) | ||
## Major changes in v5 | ||
Version 5 changes are documented in [this issue](https://github.com/naptha/tesseract.js/issues/820). Highlights are below. | ||
- Significantly smaller files by default (54% smaller for English, 73% smaller for Chinese) | ||
- This results in a ~50% reduction in runtime for first-time users (who do not have the files cached yet) | ||
- Significantly lower memory usage | ||
- Compatible with iOS 17 (using default settings) | ||
- Breaking changes: | ||
- `createWorker` arguments changed | ||
- Setting non-default language and OEM now happens in `createWorker` | ||
- E.g. `createWorker("chi_sim", 1)` | ||
- `worker.initialize` and `worker.loadLanguage` functions now do nothing and can be deleted from code | ||
- See [this issue](https://github.com/naptha/tesseract.js/issues/820) for full list | ||
## Major changes in v4 | ||
Version 4 includes many new features and bug fixes--see [this issue](https://github.com/naptha/tesseract.js/issues/662) for a full list. Several highlights are below. | ||
- Added rotation preprocessing options (including auto-rotate) for significantly better accuracy | ||
- Processed images (rotated, grayscale, binary) can now be retrieved | ||
- Improved support for parallel processing (schedulers) | ||
- Breaking changes: | ||
- `createWorker` is now async | ||
- `getPDF` function replaced by `pdf` recognize option | ||
## Major changes in v3 | ||
- Significantly faster performance | ||
- Runtime reduction of 84% for Browser and 96% for Node.js when recognizing the [example images](./examples/data) | ||
- Upgrade to Tesseract v5.1.0 (using emscripten 3.1.18) | ||
- Added SIMD-enabled build for supported devices | ||
- Added support: | ||
- Node.js version 18 | ||
- Removed support: | ||
- ASM.js version, any other old versions of Tesseract.js-core (<3.0.0) | ||
- Node.js versions 10 and 12 | ||
## Use tesseract.js the way you like! | ||
@@ -173,3 +158,3 @@ | ||
The development server will be available at http://localhost:3000/examples/browser/demo.html in your favorite browser. | ||
It will automatically rebuild `tesseract.dev.js` and `worker.dev.js` when you change files in the **src** folder. | ||
It will automatically rebuild `tesseract.min.js` and `worker.min.js` when you change files in the **src** folder. | ||
@@ -176,0 +161,0 @@ ### Online Setup with a single Click |
@@ -6,3 +6,3 @@ const webpack = require('webpack'); | ||
const cors = require('cors'); | ||
const webpackConfig = require('./webpack.config.dev'); | ||
const webpackConfig = require('./webpack.config.prod'); | ||
@@ -9,0 +9,0 @@ const compiler = webpack(webpackConfig); |
module.exports = { | ||
/* | ||
* default path for downloading *.traineddata | ||
*/ | ||
langPath: 'https://tessdata.projectnaptha.com/4.0.0', | ||
/* | ||
* Use BlobURL for worker script by default | ||
@@ -8,0 +4,0 @@ * TODO: remove this option |
@@ -6,3 +6,3 @@ const resolvePaths = require('./utils/resolvePaths'); | ||
const getId = require('./utils/getId'); | ||
const { defaultOEM } = require('./constants/config'); | ||
const OEM = require('./constants/OEM'); | ||
const { | ||
@@ -19,3 +19,3 @@ defaultOptions, | ||
module.exports = async (_options = {}) => { | ||
module.exports = async (langs = 'eng', oem = OEM.LSTM_ONLY, _options = {}, config = {}) => { | ||
const id = getId('Worker', workerCounter); | ||
@@ -33,2 +33,9 @@ const { | ||
// Current langs, oem, and config file. | ||
// Used if the user ever re-initializes the worker using `worker.reinitialize`. | ||
const currentLangs = typeof langs === 'string' ? langs.split('+') : langs; | ||
let currentOem = oem; | ||
let currentConfig = config; | ||
const lstmOnlyCore = [OEM.DEFAULT, OEM.LSTM_ONLY].includes(oem) && !options.legacyCore; | ||
let workerResReject; | ||
@@ -75,3 +82,3 @@ let workerResResolve; | ||
startJob(createJob({ | ||
id: jobId, action: 'load', payload: { options }, | ||
id: jobId, action: 'load', payload: { options: { lstmOnly: lstmOnlyCore, corePath: options.corePath, logging: options.logging } }, | ||
})) | ||
@@ -112,18 +119,58 @@ ); | ||
const loadLanguage = (langs = 'eng', jobId) => ( | ||
startJob(createJob({ | ||
id: jobId, | ||
action: 'loadLanguage', | ||
payload: { langs, options }, | ||
})) | ||
const loadLanguage = () => ( | ||
console.warn('`loadLanguage` is depreciated and should be removed from code (workers now come with language pre-loaded)') | ||
); | ||
const initialize = (langs = 'eng', oem = defaultOEM, config, jobId) => ( | ||
const loadLanguageInternal = (_langs, jobId) => startJob(createJob({ | ||
id: jobId, | ||
action: 'loadLanguage', | ||
payload: { | ||
langs: _langs, | ||
options: { | ||
langPath: options.langPath, | ||
dataPath: options.dataPath, | ||
cachePath: options.cachePath, | ||
cacheMethod: options.cacheMethod, | ||
gzip: options.gzip, | ||
lstmOnly: [OEM.TESSERACT_ONLY, OEM.TESSERACT_LSTM_COMBINED].includes(currentOem) | ||
&& !options.legacyLang, | ||
}, | ||
}, | ||
})); | ||
const initialize = () => ( | ||
console.warn('`initialize` is depreciated and should be removed from code (workers now come pre-initialized)') | ||
); | ||
const initializeInternal = (_langs, _oem, _config, jobId) => ( | ||
startJob(createJob({ | ||
id: jobId, | ||
action: 'initialize', | ||
payload: { langs, oem, config }, | ||
payload: { langs: _langs, oem: _oem, config: _config }, | ||
})) | ||
); | ||
const reinitialize = (langs = 'eng', oem, config, jobId) => { // eslint-disable-line | ||
if (lstmOnlyCore && [OEM.TESSERACT_ONLY, OEM.TESSERACT_LSTM_COMBINED].includes(oem)) throw Error('Legacy model requested but code missing.'); | ||
const _oem = oem || currentOem; | ||
currentOem = _oem; | ||
const _config = config || currentConfig; | ||
currentConfig = _config; | ||
// Only load langs that are not already loaded. | ||
// This logic fails if the user downloaded the LSTM-only English data for a language | ||
// and then uses `worker.reinitialize` to switch to the Legacy engine. | ||
// However, the correct data will still be downloaded after initialization fails | ||
// and this can be avoided entirely | ||
const langsArr = typeof langs === 'string' ? langs.split('+') : langs; | ||
const _langs = langsArr.filter((x) => currentLangs.includes(x)); | ||
currentLangs.push(_langs); | ||
return loadLanguageInternal(_langs, jobId) | ||
.then(() => initializeInternal(_langs, _oem, _config, jobId)); | ||
}; | ||
const setParameters = (params = {}, jobId) => ( | ||
@@ -156,9 +203,11 @@ startJob(createJob({ | ||
const detect = async (image, jobId) => ( | ||
startJob(createJob({ | ||
const detect = async (image, jobId) => { | ||
if (lstmOnlyCore) throw Error('`worker.detect` requires Legacy model, which was not loaded.'); | ||
return startJob(createJob({ | ||
id: jobId, | ||
action: 'detect', | ||
payload: { image: await loadImage(image) }, | ||
})) | ||
); | ||
})); | ||
}; | ||
@@ -216,2 +265,3 @@ const terminate = async () => { | ||
initialize, | ||
reinitialize, | ||
setParameters, | ||
@@ -224,5 +274,9 @@ recognize, | ||
loadInternal().then(() => workerResResolve(resolveObj)).catch(() => {}); | ||
loadInternal() | ||
.then(() => loadLanguageInternal(langs)) | ||
.then(() => initializeInternal(langs, oem, config)) | ||
.then(() => workerResResolve(resolveObj)) | ||
.catch(() => {}); | ||
return workerRes; | ||
}; |
declare namespace Tesseract { | ||
function createScheduler(): Scheduler | ||
function createWorker(options?: Partial<WorkerOptions>): Promise<Worker> | ||
function createWorker(langs?: string | Lang[], oem?: OEM, options?: Partial<WorkerOptions>, config?: string | Partial<InitOptions>): Promise<Worker> | ||
function setLogging(logging: boolean): void | ||
@@ -23,4 +23,3 @@ function recognize(image: ImageLike, langs?: string, options?: Partial<WorkerOptions>): Promise<RecognizeResult> | ||
FS(method: string, args: any[], jobId?: string): Promise<ConfigResult> | ||
loadLanguage(langs?: string | Lang[], jobId?: string): Promise<ConfigResult> | ||
initialize(langs?: string | Lang[], oem?: OEM, config?: string | Partial<InitOptions>, jobId?: string): Promise<ConfigResult> | ||
reinitialize(langs?: string | Lang[], oem?: OEM, config?: string | Partial<InitOptions>, jobId?: string): Promise<ConfigResult> | ||
setParameters(params: Partial<WorkerParams>, jobId?: string): Promise<ConfigResult> | ||
@@ -65,2 +64,4 @@ getImage(type: imageType): string | ||
gzip: boolean | ||
legacyLang: boolean | ||
legacyCore: boolean | ||
logger: (arg: LoggerMessage) => void, | ||
@@ -67,0 +68,0 @@ errorHandler: (arg: any) => void |
const createWorker = require('./createWorker'); | ||
const recognize = async (image, langs, options) => { | ||
const worker = await createWorker(options); | ||
await worker.loadLanguage(langs); | ||
await worker.initialize(langs); | ||
const worker = await createWorker(langs, 1, options); | ||
return worker.recognize(image) | ||
@@ -14,5 +12,3 @@ .finally(async () => { | ||
const detect = async (image, options) => { | ||
const worker = await createWorker(options); | ||
await worker.loadLanguage('osd'); | ||
await worker.initialize('osd'); | ||
const worker = await createWorker('osd', 0, options); | ||
return worker.detect(image) | ||
@@ -19,0 +15,0 @@ .finally(async () => { |
const { simd } = require('wasm-feature-detect'); | ||
const { dependencies } = require('../../../package.json'); | ||
module.exports = async (corePath, res) => { | ||
module.exports = async (lstmOnly, corePath, res) => { | ||
if (typeof global.TesseractCore === 'undefined') { | ||
res.progress({ status: 'loading tesseract core', progress: 0 }); | ||
const statusText = 'loading tesseract core'; | ||
res.progress({ status: statusText, progress: 0 }); | ||
// If the user specifies a core path, we use that | ||
@@ -22,3 +24,9 @@ // Otherwise, default to CDN | ||
if (simdSupport) { | ||
corePathImportFile = `${corePathImport.replace(/\/$/, '')}/tesseract-core-simd.wasm.js`; | ||
if (lstmOnly) { | ||
corePathImportFile = `${corePathImport.replace(/\/$/, '')}/tesseract-core-simd-lstm.wasm.js`; | ||
} else { | ||
corePathImportFile = `${corePathImport.replace(/\/$/, '')}/tesseract-core-simd.wasm.js`; | ||
} | ||
} else if (lstmOnly) { | ||
corePathImportFile = `${corePathImport.replace(/\/$/, '')}/tesseract-core-lstm.wasm.js`; | ||
} else { | ||
@@ -40,5 +48,5 @@ corePathImportFile = `${corePathImport.replace(/\/$/, '')}/tesseract-core.wasm.js`; | ||
} | ||
res.progress({ status: 'loading tesseract core', progress: 1 }); | ||
res.progress({ status: statusText, progress: 1 }); | ||
} | ||
return global.TesseractCore; | ||
}; |
@@ -31,11 +31,15 @@ /** | ||
let params = defaultParams; | ||
let cachePathWorker; | ||
let cacheMethodWorker; | ||
let loadLanguageLangsWorker; | ||
let loadLanguageOptionsWorker; | ||
let dataFromCache = false; | ||
const load = async ({ workerId, jobId, payload: { options: { corePath, logging } } }, res) => { | ||
const load = async ({ workerId, jobId, payload: { options: { lstmOnly, corePath, logging } } }, res) => { // eslint-disable-line max-len | ||
setLogging(logging); | ||
const statusText = 'initializing tesseract'; | ||
if (!TessModule) { | ||
const Core = await adapter.getCore(corePath, res); | ||
const Core = await adapter.getCore(lstmOnly, corePath, res); | ||
res.progress({ workerId, status: 'initializing tesseract', progress: 0 }); | ||
res.progress({ workerId, status: statusText, progress: 0 }); | ||
@@ -53,3 +57,3 @@ Core({ | ||
TessModule = tessModule; | ||
res.progress({ workerId, status: 'initialized tesseract', progress: 1 }); | ||
res.progress({ workerId, status: statusText, progress: 1 }); | ||
res.resolve({ loaded: true }); | ||
@@ -77,2 +81,3 @@ }); | ||
gzip = true, | ||
lstmOnly, | ||
}, | ||
@@ -82,6 +87,18 @@ }, | ||
res) => { | ||
// Remember cache options for later, as cache may be deleted if `initialize` fails | ||
cachePathWorker = cachePath; | ||
cacheMethodWorker = cacheMethod; | ||
// Remember options for later, as cache may be deleted if `initialize` fails | ||
loadLanguageLangsWorker = langs; | ||
loadLanguageOptionsWorker = { | ||
langPath, | ||
dataPath, | ||
cachePath, | ||
cacheMethod, | ||
gzip, | ||
lstmOnly, | ||
}; | ||
const statusText = 'loading language traineddata'; | ||
const langsArr = typeof langs === 'string' ? langs.split('+') : langs; | ||
let progress = 0; | ||
const loadAndGunzipFile = async (_lang) => { | ||
@@ -101,4 +118,4 @@ const lang = typeof _lang === 'string' ? _lang : _lang.code; | ||
log(`[${workerId}]: Load ${lang}.traineddata from cache`); | ||
res.progress({ workerId, status: 'loading language traineddata (from cache)', progress: 0.5 }); | ||
data = _data; | ||
dataFromCache = true; | ||
} else { | ||
@@ -114,10 +131,15 @@ throw Error('Not found in cache'); | ||
// If `langPath` if not explicitly set by the user, the jsdelivr CDN is used. | ||
// Data supporting the Legacy model is only included if `lstmOnly` is not true. | ||
// This saves a significant amount of data for the majority of users that use LSTM only. | ||
const langPathDownload = langPath || (lstmOnly ? `https://cdn.jsdelivr.net/npm/@tesseract.js-data/${lang}/4.0.0_best_int` : `https://cdn.jsdelivr.net/npm/@tesseract.js-data/${lang}/4.0.0`); | ||
// For Node.js, langPath may be a URL or local file path | ||
// The is-url package is used to tell the difference | ||
// For the browser version, langPath is assumed to be a URL | ||
if (env !== 'node' || isURL(langPath) || langPath.startsWith('moz-extension://') || langPath.startsWith('chrome-extension://') || langPath.startsWith('file://')) { /** When langPath is an URL */ | ||
path = langPath.replace(/\/$/, ''); | ||
if (env !== 'node' || isURL(langPathDownload) || langPathDownload.startsWith('moz-extension://') || langPathDownload.startsWith('chrome-extension://') || langPathDownload.startsWith('file://')) { /** When langPathDownload is an URL */ | ||
path = langPathDownload.replace(/\/$/, ''); | ||
} | ||
// langPath is a URL, fetch from server | ||
// langPathDownload is a URL, fetch from server | ||
if (path !== null) { | ||
@@ -131,6 +153,6 @@ const fetchUrl = `${path}/${lang}.traineddata${gzip ? '.gz' : ''}`; | ||
// langPath is a local file, read .traineddata from local filesystem | ||
// langPathDownload is a local file, read .traineddata from local filesystem | ||
// (adapter.readCache is a generic file read function in Node.js version) | ||
} else { | ||
data = await adapter.readCache(`${langPath}/${lang}.traineddata${gzip ? '.gz' : ''}`); | ||
data = await adapter.readCache(`${langPathDownload}/${lang}.traineddata${gzip ? '.gz' : ''}`); | ||
} | ||
@@ -142,2 +164,5 @@ } else { | ||
progress += 0.5 / langsArr.length; | ||
if (res) res.progress({ workerId, status: statusText, progress }); | ||
// Check for gzip magic numbers (1F and 8B in hex) | ||
@@ -155,3 +180,3 @@ const isGzip = (data[0] === 31 && data[1] === 139) || (data[1] === 31 && data[0] === 139); | ||
} catch (err) { | ||
res.reject(err.toString()); | ||
if (res) res.reject(err.toString()); | ||
} | ||
@@ -170,12 +195,15 @@ } | ||
} | ||
return Promise.resolve(); | ||
progress += 0.5 / langsArr.length; | ||
// Make sure last progress message is 1 (not 0.9999) | ||
if (Math.round(progress * 100) === 100) progress = 1; | ||
if (res) res.progress({ workerId, status: statusText, progress }); | ||
}; | ||
res.progress({ workerId, status: 'loading language traineddata', progress: 0 }); | ||
if (res) res.progress({ workerId, status: statusText, progress: 0 }); | ||
try { | ||
await Promise.all((typeof langs === 'string' ? langs.split('+') : langs).map(loadAndGunzipFile)); | ||
res.progress({ workerId, status: 'loaded language traineddata', progress: 1 }); | ||
res.resolve(langs); | ||
await Promise.all(langsArr.map(loadAndGunzipFile)); | ||
if (res) res.resolve(langs); | ||
} catch (err) { | ||
res.reject(err.toString()); | ||
if (res) res.reject(err.toString()); | ||
} | ||
@@ -221,5 +249,7 @@ }; | ||
const statusText = 'initializing api'; | ||
try { | ||
res.progress({ | ||
workerId, status: 'initializing api', progress: 0, | ||
workerId, status: statusText, progress: 0, | ||
}); | ||
@@ -244,3 +274,3 @@ if (api !== null) { | ||
api = new TessModule.TessBaseAPI(); | ||
const status = api.Init(null, langs, oem); | ||
let status = api.Init(null, langs, oem); | ||
if (status === -1) { | ||
@@ -250,13 +280,46 @@ // Cache is deleted if initialization fails to avoid keeping bad data in cache | ||
// this should be refined if other reasons for init failing are encountered. | ||
if (['write', 'refresh', undefined].includes(cacheMethodWorker)) { | ||
// The "if" condition skips this section if either (1) cache is disabled [so the issue | ||
// is definitely unrelated to cached data] or (2) cache is set to read-only | ||
// [so we do not have permission to make any changes]. | ||
if (['write', 'refresh', undefined].includes(loadLanguageOptionsWorker.cacheMethod)) { | ||
const langsArr = langs.split('+'); | ||
const delCachePromise = langsArr.map((lang) => adapter.deleteCache(`${cachePathWorker || '.'}/${lang}.traineddata`)); | ||
const delCachePromise = langsArr.map((lang) => adapter.deleteCache(`${loadLanguageOptionsWorker.cachePath || '.'}/${lang}.traineddata`)); | ||
await Promise.all(delCachePromise); | ||
// Check for the case when (1) data was loaded from the cache and | ||
// (2) the data does not support the requested OEM. | ||
// In this case, loadLanguage is re-run and initialization is attempted a second time. | ||
// This is because `loadLanguage` has no mechanism for checking whether the cached data | ||
// supports the requested model, so this only becomes apparent when initialization fails. | ||
// Check for this error message: | ||
// eslint-disable-next-line | ||
// "Tesseract (legacy) engine requested, but components are not present in ./eng.traineddata!!"" | ||
// The .wasm build of Tesseract saves this message in a separate file | ||
// (in addition to the normal debug file location). | ||
const debugStr = TessModule.FS.readFile('/debugDev.txt', { encoding: 'utf8', flags: 'a+' }); | ||
if (dataFromCache && /components are not present/.test(debugStr)) { | ||
log('Data from cache missing requested OEM model. Attempting to refresh cache with new language data.'); | ||
// In this case, language data is re-loaded | ||
await loadLanguage({ workerId, payload: { langs: loadLanguageLangsWorker, options: loadLanguageOptionsWorker } }); // eslint-disable-line max-len | ||
status = api.Init(null, langs, oem); | ||
if (status === -1) { | ||
log('Language data refresh failed.'); | ||
const delCachePromise2 = langsArr.map((lang) => adapter.deleteCache(`${loadLanguageOptionsWorker.cachePath || '.'}/${lang}.traineddata`)); | ||
await Promise.all(delCachePromise2); | ||
} else { | ||
log('Language data refresh successful.'); | ||
} | ||
} | ||
} | ||
} | ||
if (status === -1) { | ||
res.reject('initialization failed'); | ||
} | ||
params = defaultParams; | ||
await setParameters({ payload: { params } }); | ||
res.progress({ | ||
workerId, status: 'initialized api', progress: 1, | ||
workerId, status: statusText, progress: 1, | ||
}); | ||
@@ -263,0 +326,0 @@ res.resolve(); |
const { simd } = require('wasm-feature-detect'); | ||
const OEM = require('../../constants/OEM'); | ||
@@ -8,14 +9,22 @@ let TesseractCore = null; | ||
*/ | ||
module.exports = async (_, res) => { | ||
module.exports = async (oem, _, res) => { | ||
if (TesseractCore === null) { | ||
const statusText = 'loading tesseract core'; | ||
const simdSupport = await simd(); | ||
res.progress({ status: 'loading tesseract core', progress: 0 }); | ||
res.progress({ status: statusText, progress: 0 }); | ||
if (simdSupport) { | ||
TesseractCore = require('tesseract.js-core/tesseract-core-simd'); | ||
if ([OEM.DEFAULT, OEM.LSTM_ONLY].includes(oem)) { | ||
TesseractCore = require('tesseract.js-core/tesseract-core-simd-lstm'); | ||
} else { | ||
TesseractCore = require('tesseract.js-core/tesseract-core-simd'); | ||
} | ||
} else if ([OEM.DEFAULT, OEM.LSTM_ONLY].includes(oem)) { | ||
TesseractCore = require('tesseract.js-core/tesseract-core-lstm'); | ||
} else { | ||
TesseractCore = require('tesseract.js-core/tesseract-core'); | ||
} | ||
res.progress({ status: 'loaded tesseract core', progress: 1 }); | ||
res.progress({ status: statusText, progress: 1 }); | ||
} | ||
return TesseractCore; | ||
}; |
@@ -1,2 +0,1 @@ | ||
const resolveURL = (s) => (new URL(s, window.location.href)).href; | ||
const { version } = require('../../../package.json'); | ||
@@ -10,10 +9,3 @@ const defaultOptions = require('../../constants/defaultOptions'); | ||
...defaultOptions, | ||
workerPath: (typeof process !== 'undefined' && process.env.TESS_ENV === 'development') | ||
? resolveURL(`/dist/worker.dev.js?nocache=${Math.random().toString(36).slice(3)}`) | ||
: `https://cdn.jsdelivr.net/npm/tesseract.js@v${version}/dist/worker.min.js`, | ||
/* | ||
* If browser doesn't support WebAssembly, | ||
* load ASM version instead | ||
*/ | ||
corePath: null, | ||
workerPath: `https://cdn.jsdelivr.net/npm/tesseract.js@v${version}/dist/worker.min.js`, | ||
}; |
Sorry, the diff of this file is too big to display
Sorry, the diff of this file is not supported yet
Sorry, the diff of this file is too big to display
Sorry, the diff of this file is not supported yet
Sorry, the diff of this file is too big to display
Sorry, the diff of this file is not supported yet
Sorry, the diff of this file is not supported yet
Sorry, the diff of this file is not supported yet
Sorry, the diff of this file is not supported yet
Sorry, the diff of this file is not supported yet
License Policy Violation
LicenseThis package is not allowed per your license policy. Review the package's license to ensure compliance.
Found 1 instance in 1 package
License Policy Violation
LicenseThis package is not allowed per your license policy. Review the package's license to ensure compliance.
Found 1 instance in 1 package
Environment variable access
Supply chain riskPackage accesses environment variables, which may be a sign of credential stuffing or data theft.
Found 1 instance in 1 package
1454727
3406
5
80
199
+ Addedtesseract.js-core@5.1.1(transitive)
- Removedtesseract.js-core@4.0.4(transitive)
Updatedtesseract.js-core@^5.0.0