Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

pdflayoutxt

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

pdflayoutxt

This library helps in extracting text from searchable pdf files by keeping the layout intact.

  • 0.0.10
  • PyPI
  • Socket score

Maintainers
1

pdflayoutxt

pdflayoutxt is a Python library for extracting text from searchable pdf's (Non Scanned) and it make sures the extracted text is in the same layout as the document.

Installation

Use the package manager pip to install foobar.

pip install pdflayoutxt

Usage

# import the library
import pdflayoutxt

# creates an object of pdfextracter
pdfobj=pdflayoutxt.pdfextracter()

# returns a list, each index being the text extracted from that index page. 
# In simple terms no_of_pages_in_document==len(list_returned)
pdf_path="./abc.pdf"
text=pdfobj.get_pdf_text(pdf_path=pdf_path)

# output
print(text)
MethodDescription
.get_pdf_text(pdf_path,pdf_password="",pages=[],left_most_x=0,left_most_y=0,right_most_x=1,right_most_y=1)Returns a list of list, of texts, present in each of the page in the document.pdf_password argument takes a string input,if pdf is encrypted with password, the password needs to be passed to this argument. Pages argument takes a list of pages or int (single page) from where the text needs to be extracted, if text from all pages are required the default parameter will take care. left_most_x this parameter defines the starting point of text extraction on x axis (width). Its value lies between [0,1], like if we need .25 percent of right side of page (width) then we will pass .75 as argument. left_most_y this parameter defines the starting point of text extraction on y axis (height). Its value lies between [0,1], like if we need .25 percent of text from bottom side of page (height) then we will pass .75 as argument. right_most_x this parameter defines the end point of text extraction on x axis (width). Its value lies between [0,1]. right_most_y this parameter defines the end point of text extraction on y axis (height). Its value lies between [0,1]. These parameters right_most_y,left_most_x,right_most_x,left_most_y are set to default for extracting text from complete page without cropping, if the text needs to be extracted from a particular area of page, these parameters become handy.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

MIT

Keywords

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc