Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

bangla-pdf-ocr

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

bangla-pdf-ocr

A package to extract Bengali text from PDFs using OCR

  • 0.1.0
  • PyPI
  • Socket score

Maintainers
1

Bangla PDF OCR

Extract Bengali text from PDFs using OCR.

Installation

  1. Install the package:

    pip install bangla-pdf-ocr
    
  2. Install system dependencies:

    bangla-pdf-ocr-setup
    

    This command installs tesseract-ocr, poppler-utils, and tesseract-ocr-ben on Linux, or tesseract, poppler, and tesseract-lang on macOS. Windows users should follow the on-screen instructions.

Usage

Extract text from the default PDF:

bangla-pdf-ocr

Optional Arguments:

  • -o or --output: Specify the output file path (default is input filename with .txt extension)
  • -l or --language: Specify the OCR language (default is 'ben' for Bengali)

Examples:

  1. Using Default PDF and Output:

    bangla-pdf-ocr
    

    This will process Freedom Fight.pdf and save the extracted text to Freedom Fight.txt.

  2. Specifying Output File:

    bangla-pdf-ocr my_document.pdf -o extracted_text.txt
    
  3. Specifying OCR Language:

    bangla-pdf-ocr my_document.pdf -l eng
    

Python Module Usage

You can also use Bangla PDF OCR in your Python scripts:

from bangla_pdf_ocr import ocr

# Extract text from a PDF
extracted_text = ocr.process_pdf('path/to/your/file.pdf', output_file='output.txt', language='ben')

# Print the extracted text
print(extracted_text)

Troubleshooting

If you encounter any issues:

  1. Ensure Dependencies Are Installed:

    Make sure Tesseract and Poppler are properly installed and their directories are in your system's PATH.

  2. For Windows Users:

    • Verify you've installed the Bengali language data for Tesseract.
    • Ensure the tessdata directory contains ben.traineddata.
  3. Check Logs:

    Review the console output and logs for any error messages.

  4. Re-run Setup Command:

    If dependencies were not installed correctly, try running:

    bangla-pdf-ocr-setup
    

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc