
Security News
/Research
Wallet-Draining npm Package Impersonates Nodemailer to Hijack Crypto Transactions
Malicious npm package impersonates Nodemailer and drains wallets by hijacking crypto transactions across multiple blockchains.
n8n-nodes-tesseractjs
Advanced tools
A n8n module that exposes Tesseract.js, an OCR library that can detect text on images
This is a n8n community node. It lets you use Tesseract.js in your n8n workflows.
Tesseract is an open source OCR (Optical Character Recognition) engine that can recognize text (machine/typed/printed text, not handwritten) in images (e.g. PNG or JPEG) or images embedded in PDF files.
To read text from PDFs directly (not from images), you may want the Extract from PDF operation of the built-in Extract from file node
n8n is a fair-code licensed workflow automation platform.
Installation
Operations
Compatibility
Usage
Resources
Version history
Follow the installation guide in the n8n community nodes documentation.
You can quickly get started by importing the sample playbook.
Note: both operations will output a new binary field, called ocr
, which contains the image that was actually OCR'd.
For input images, this will be the same image. For input PDFs, this will be each image in the PDF.
This operation reads the text of the entire image. It outputs a JSON item containing the entire recognized text, and a " confidence value" indicating how likely the generated text is to match the source image, as a percentage:
{
"text": "...",
"confidence": 94
}
If passed a PDF instead of an image, the node may output several items, one for each image in the PDF. Each item will have the format described above.
This operation also reads text, but returns more information about the bounding box of each detected block, and the detected language if available.
The "granularity" of the detections can be controlled: you can split on paragraphs, lines, words or individual characters.
{
"blocks": [
{
"text": "This",
"confidence": 95.15690612792969,
"bbox": {
"x0": 36,
"y0": 92,
"x1": 96,
"y1": 116
},
"language": "eng"
},
{
"text": "is",
"confidence": 95.15690612792969,
"bbox": {
"x0": 109,
"y0": 92,
"x1": 129,
"y1": 116
},
"language": "eng"
},
...
]
}
Entire paragraphs:
Per-line statistics:
If passed a PDF instead of an image, the node may output several items, one for each image in the PDF. Each item will have the format described above.
This node has been tested on n8n v1.68.0, but should also work on older versions. If you encounter an issue with an older version, please raise an issue.
All Operations of this node have a field Input Image Field Name, where the name of a Binary item should be provided:
The Binary file with that name will be read and processed. It should be an image or a PDF document. If a PDF, all images inside the PDF will be extracted and processed separately.
It's possible to limit the OCR to a certain segment of an image, by toggling the Detect on Entire Image? switch to Off. When doing so, it's necessary to provide the top and left coordinates of the desired bounding box, and the width and height of the box. For example:
This could delineate a region like this:
When performing OCR on that region, it'd only return the text that was contained in that box, even truncating words of they lie halfway across the bounding box's borders:
This option can be used if Tesseract can't autodetect the image's resolution, such as a PNG that doesn't carry that information.
If you'd like to abort OCR when a certain image takes too long to process, set the Timeout option (in milliseconds).
Any items that take longer than that time to process will be interrupted and raise an error. Set On Error on the node's Settings if you want execution to continue.
Items that time out will have the error
or timeout
key set in the JSON output, so they can be extracted later if
desired. Items that have a confidence
were successfully processed.
If you know that the source image can only have certain characters (e.g. license plates that can only have uppercase letters, numbers or a dash) or can't have certain characters (e.g. if Tesseract is inserting question marks when there aren't any), you can explicitly specify which characters will be allowed or which characters will be ignored.
For example, when only the digits 0-9 are allowed, and splitting on words:
Or, when disallowing all uppercase characters:
Initial version, contains the Extract text and Extract boxes operations.
npm i
To add this node to a local N8N install:
npm link
npm run dev # or npm build the first time or when adding assets, such as the node's logo
# in ~/.n8n/custom
npm link n8n-nodes-tesseractjs
FAQs
A n8n module that exposes Tesseract.js, an OCR library that can detect text on images
The npm package n8n-nodes-tesseractjs receives a total of 1,971 weekly downloads. As such, n8n-nodes-tesseractjs popularity was classified as popular.
We found that n8n-nodes-tesseractjs demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
/Research
Malicious npm package impersonates Nodemailer and drains wallets by hijacking crypto transactions across multiple blockchains.
Security News
This episode explores the hard problem of reachability analysis, from static analysis limits to handling dynamic languages and massive dependency trees.
Security News
/Research
Malicious Nx npm versions stole secrets and wallet info using AI CLI tools; Socket’s AI scanner detected the supply chain attack and flagged the malware.