
Research
/Security News
Critical Vulnerability in NestJS Devtools: Localhost RCE via Sandbox Escape
A flawed sandbox in @nestjs/devtools-integration lets attackers run code on your machine via CSRF, leading to full Remote Code Execution (RCE).
ocr-document-classification
Advanced tools
Document classification using tesseract.js and string-similarity-js.
The OCR Document Classification package provides a utility to classify documents based on their content. It uses OCR (Optical Character Recognition) to extract text from images and then determines the document type by matching extracted words with predefined target words using string similarity.
To install this package, use npm:
npm install ocr-document-classification
The main function exported by this package is classifyDocument
. Below is a detailed guide on how to use it.
import { classifyDocument } from "ocr-document-classification";
import type { documentDictionary } from "ocr-document-classification";
A Promise that resolves with an object containing:
classification
: The determined document type.text
: The extracted text from the document.There exists a couple of default classes that can be useful to classify the most common documents. As you can see there exists multiple arrays for each key. This means that every word of only ONE of the arrays needs to be found in the document after OCR. You can also add your own class my creating a customDocumentDictionary.
const defaultDocumentDictionary: documentDictionary = {
MILITÆRBEVIS: [
["førstegangstjeneste", "bevis", "avtjent"],
["attest", "førstegangstjeneste"],
["fullført", "førstegangstjeneste"],
],
POLITIATTEST: [["politiattest", "politidistrikt"], ["police certificate"]],
KOMPETANSEBEVIS: [["omfatter", "opplæring", "utdanningsprogram"]],
LEGEERKLÆRING: [["legeerklæring", "fødselsnummer"]],
BOSTEDSATTEST: [
["registrerte", "opplysninger", "folkeregisteret"],
["bostedsattest", "bostedsadresse", "registrert"],
["registrert", "adressehistorikk", "folkeregisteret"],
],
};
Here is an example of how to use the package can be used with a custom document dictionary in React:
import React, { useState, useEffect } from "react";
import { classifyDocument } from "ocr-document-classification";
function UploadClassification() {
const [documentFile, setDocumentFile] = useState<File | null>(null);
const [classification, setClassification] = useState("");
const [outputText, setOutputText] = useState("");
const [progress, setProgress] = useState(0);
const handleFileChange = (event: React.ChangeEvent<HTMLInputElement>) => {
const file = event.target.files && event.target.files[0];
setDocumentFile(file);
};
const customDocumentDictionary = {
Jobbsøknad: [["søknad", "stilling", "ledig"]],
};
useEffect(() => {
console.log("Progress: ", progress);
}, [progress]);
useEffect(() => {
if (documentFile) {
classifyDocument(documentFile, {
onProgress: setProgress,
customDocumentDictionary: customDocumentDictionary,
})
.then(({ classification, text }) => {
setClassification(classification);
setOutputText(text);
})
.catch((err) => {
console.error(err);
setOutputText("Error during OCR processing");
});
}
resetOCR();
}, [documentFile]);
function resetOCR() {
setClassification("");
setOutputText("");
setProgress(0);
}
return (
<>
<input
accept="image/jpeg, image/png"
type="file"
onChange={handleFileChange}
/>
<div>
<h3>Resultat av OCR</h3>
<p>{classification ? outputText : "Laster inn ..."}</p>
<h1>{classification}</h1>
</div>
</>
);
}
export default UploadClassification;
This package relies on the following dependencies:
string-similarity-js
: For calculating the similarity between strings.tesseract.js
: For performing OCR on the document image.pdfjs-dist
: For handling PDFsThis package is currently UNLICENSED.
FAQs
Document classification using tesseract.js and string-similarity-js.
The npm package ocr-document-classification receives a total of 20 weekly downloads. As such, ocr-document-classification popularity was classified as not popular.
We found that ocr-document-classification demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 0 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
/Security News
A flawed sandbox in @nestjs/devtools-integration lets attackers run code on your machine via CSRF, leading to full Remote Code Execution (RCE).
Product
Customize license detection with Socket’s new license overlays: gain control, reduce noise, and handle edge cases with precision.
Product
Socket now supports Rust and Cargo, offering package search for all users and experimental SBOM generation for enterprise projects.