New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More →

pdf-image-text9-test

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

pdf-image-text9-test

A python package to generate text from pdf and images

0.1.9
PyPI

Maintainers: 1

Pdf-Image-Text

A library to get data from standalone images and images present inside pdf. Powered by fitz and OpenAI.

Github - https://github.com/Bain/aag-pdf-image-text

Description:

This library provides below two functionalities:

PdfToText - A class which fetch the images present on pdf files, get image transcriptions from openAI, fetches plain text from the pdf page and return a object with image and text data. The users can command to process full pdf file or specific page.
ImageToText - A class which process the transcription of the provided image.

Instructions

1. Installation

pip install pdf-image-text

2. Initialize

pdf_to_text = PdfToText(open_ai_key='<>', model='<>') image_to_text = ImageToText(open_ai_key='<>', model='<>')

Note: The parameters are not required if already present in environment variables as 'OPEN_AI_KEY' and 'MODEL'

3. Load data

From file -

Pdf2Text: `pdf_to_text.load_data(file_name='<Path to file>')`

Image2Text: `image_to_text.load_data(file_name='<Path to file>')`

From file object -

Pdf2Text: `pdf_to_text.load_data(file_bytes_object='<file content>')`

Image2Text: `image_to_text.load_data(file_bytes_object='<file content>')`

3. Get output

Pdf2Text:

image_filter = ImageFilter(lower_height=<int>, upper_height=<int>, lower_width=<int>, upper_width=<int>)

output = pdf_to_text.get_pdf_content(image_filter=image_filter, page_index=<optional field: int>, include_formatted_content=<optional field: bool>)

Image2Text:

output = image_to_text.get_image_transcription()

4. Response Object

Pdf2Text:

The output response contains a list of Page object. The page object consists of below attributes -

image_content: A list of transcriptions for images fetched from the current pdf page.
text_content: The plain text fetched from the current pdf page.
formatted_content [Optional] : An optional attribute which contains the formatted string output containing the plain text and figure data (Inside FIGURE TRANSCRIPTIONS section). This is useful in knowledge bot applications. The default value for this flag is false.

Image2Text:

The response contains a string representing the transcription of the provided image.

Keywords

FAQs

What is pdf-image-text9-test?

Is pdf-image-text9-test well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

pdf-image-text9-test

Pdf-Image-Text

Description:

Instructions

1. Installation

2. Initialize

3. Load data

Pdf2Text: pdf_to_text.load_data(file_name='<Path to file>')

Image2Text: image_to_text.load_data(file_name='<Path to file>')

Pdf2Text: pdf_to_text.load_data(file_bytes_object='<file content>')

Image2Text: image_to_text.load_data(file_bytes_object='<file content>')

3. Get output

Pdf2Text:

Image2Text:

4. Response Object

Pdf2Text:

Image2Text:

Keywords

Related posts

Linux Foundation Warns Open Source Developers: Compliance with Sanctions Is Not Optional

Maven Central Adds Sigstore Signature Validation

Pdf2Text: `pdf_to_text.load_data(file_name='<Path to file>')`

Image2Text: `image_to_text.load_data(file_name='<Path to file>')`

Pdf2Text: `pdf_to_text.load_data(file_bytes_object='<file content>')`

Image2Text: `image_to_text.load_data(file_bytes_object='<file content>')`