You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP
Socket
Book a DemoInstallSign in
Socket

pdfwordify

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

pdfwordify

Tool for extracting text and tables from PDF files and saving this data in docx format

0.0.1
pipPyPI
Maintainers
1

pdfwordify

pdfwordify is a tool for extracting text and tables from PDF files and saving this data in docx(Word) format. This project is designed to automate the process of transferring information from PDF to formats that are easier to edit and process.

Features

  • Text extraction from PDF.
  • Extract text from scanned pages to PDF.
  • Extract tables from PDF.
  • Save extracted information to a Word file.

How to use

  • Install Python 3.10 or newer.

  • Install Google tesseract OCR

  • Install the library using pip:

    pip install pdfwordify
    
  • Use the command-line interface to convert from PDF to docx.

    pdfwordify example.pdf
    
  • Or use it with Python.

    from pdfwordify.converter import convert_to_docx
    
    convert_to_docx("example.pdf")
    

Arguments

This section will provide arguments for using the converter. They are suitable for use within the command line as well as for use within Python.

  • pdf_path:

    • Description: The path to the input PDF file to be converted.
    • Required: Yes
    • Example:
      • In terminal: pdfwordify dir/example.pdf.
      • In code: convert_to_docx("dir/example.pdf").
  • output_dir:

    • Description: The path for the docx file. Can be either a folder path, a named path, or a full path specifying the file(docx) extension.
    • Required: No
    • Default: PDF file directory is used
    • Example:
      • In terminal: pdfwordify dir/example.pdf /output/path/.
      • In code: convert_to_docx("dir/example.pdf", "/output/path/")
  • method:

    • Description: Method for extracting tables from a file.
    • Required: No
    • Default: lattice
    • Types:
      • lattice for tables that have distinct boundaries.

        Table with clear boundaries
      • stream for tables that have clear borders.

        Table with no borders
      • None if there are no tables in the document.

    • Example:
      • In terminal: pdfwordify --method stream dir/example.pdf.
      • In code: convert_to_docx("example.pdf", method=None).
  • lang:

    • Description: Language for extracting text from images within a document using Google Tesseract OCR.
    • Required: No
    • Default: eng
    • Note: It is possible to combine languages. For example: rus+eng
    • Example:
      • In terminal: pdfwordify --lang rus+eng dir/example.pdf.
      • In code: convert_to_docx("example.pdf", lang="rus+eng").

Settings

To further customize the settings, edit the config.py file.

Keywords

convert

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

About

Packages

Stay in touch

Get open source security insights delivered straight into your inbox.

  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc

U.S. Patent No. 12,346,443 & 12,314,394. Other pending.