Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More →

ocr-document-classification

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

ocr-document-classification

Document classification using tesseract.js and string-similarity-js.

1.0.6
npm

Version published: 5 months ago

Weekly downloads: 2; decreased by-71.43%

Maintainers: 0

Weekly downloads

Created: 5 months ago

Source

OCR Document Classification

Overview

The OCR Document Classification package provides a utility to classify documents based on their content. It uses OCR (Optical Character Recognition) to extract text from images and then determines the document type by matching extracted words with predefined target words using string similarity.

Installation

To install this package, use npm:

npm install ocr-document-classification

Usage

The main function exported by this package is classifyDocument. Below is a detailed guide on how to use it.

Importing the Package

import { classifyDocument } from "ocr-document-classification";
import type { documentDictionary } from "ocr-document-classification";

Function: classifyDocument

Parameters

file: The image file (File object) of the document to be classified.
onProgress (optional): A callback function to receive progress updates. It accepts a number between 0 and 1.
customDocumentDictionary (optional): An object containing custom document types and their associated target words.

Returns

A Promise that resolves with an object containing:

classification: The determined document type.
text: The extracted text from the document.

Classes

There exists a couple of default classes that can be useful to classify the most common documents. As you can see there exists multiple arrays for each key. This means that every word of only ONE of the arrays needs to be found in the document after OCR. You can also add your own class my creating a customDocumentDictionary

const defaultDocumentDictionary: documentDictionary  = {
    BEVISPAFORSTEGANGSTJENESTE: [
      ['førstegangstjeneste', 'bevis', 'avtjent'],
      ['attest', 'førstegangstjeneste'],
      ['fullført', 'førstegangstjeneste'],
    ],
    POLITIATTEST: [['politiattest', 'politidistrikt'], ['police certificate']],
    KOMPETANSEBEVIS: [['omfatter', 'opplæring', 'utdanningsprogram']],
    LEGEERKLERING: [['legeerklæring', 'fødselsnummer']],
  }

Example

Here is an example of how to use the package can be used with a custom document dictionary in React:

function UploadClassification() {
  const [documentFile, setDocumentFile] = useState<File | null>(null)
  const [classification, setClassification] = useState('')
  const [outputText, setOutputText] = useState('')
  const [progress, setProgress] = useState(0)

  const handleFileChange = (event: React.ChangeEvent<HTMLInputElement>) => {
    const file = event.target.files && event.target.files[0]
    setDocumentFile(file)
  }

  const customDocumentDictionary = {
    Jobbsøknad: [['søknad', 'stilling', 'ledig']],
  }

    useEffect(() => {
    console.log('Progress: ', progress)
  }, [progress])

  useEffect(() => {
    if (documentFile) {
      classifyDocument(documentFile, setProgress, customDocumentDictionary)
        .then(({ classification, text }) => {
          setClassification(classification)
          setOutputText(text)
        })
        .catch((err) => {
          console.error(err)
          setOutputText('Error during OCR processing')
        })
    }
    resetOCR()
  }, [documentFile])

  function resetOCR() {
    setClassification('')
    setOutputText('')
    setImageSrc('')
    setProgress(0)
  }

return (
    <>
    <input
        accept="image/jpeg, image/png"
        type="file"
        onChange={handleFileChange}
    />
    <div>
        <h3>Resultat av OCR</h3>
        <p>{classification ? outputText : 'Laster inn ...'}</p>
        <h1>{classification}</h1>
    </div>
)}

Dependencies

This package relies on the following dependencies:

string-similarity-js: For calculating the similarity between strings.
tesseract.js: For performing OCR on the document image.

LICENSE

This package is currently UNLICENSED.

Keywords

FAQs

What is ocr-document-classification?

Is ocr-document-classification popular?

Is ocr-document-classification well maintained?

Package last updated on 28 Jun 2024

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install