Socket
Book a DemoInstallSign in
Socket

pdfsp

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

pdfsp

Extracts data from PDF files and saves it to Excel files.

0.1.14
pipPyPI
Maintainers
1

Pypi Windows Server Pypi Ubuntu Server

PyPI PyPI Downloadst

πŸ“„ pdfsp

pdfsp is a Python package that extracts tables from PDF files and saves them to Excel. It also provides a simple Streamlit app for interactive viewing of the extracted data.

πŸš€ Features

  • Extracts tabular data from PDFs using pdfplumber
  • Converts tables into pandas DataFrames
  • Saves output as .xlsx Excel files using openpyxl
  • Ensures column names are unique to prevent issues
  • Visualizes DataFrames with streamlit

πŸ“¦ Installation

Make sure you're using Python 3.10 or newer, then install with:

pip install pdfsp -U

python script

# pdf.py
from pdfsp import extract_tables, Options

# Define extraction options
source_folder = "."
output_folder = "output"
combine_tables = True

options = Options(
    source_folder=source_folder,
    output_folder=output_folder,
    combine=combine_tables
)

# Run the table extraction
extract_tables(options)


From console / Terminal / Command Line

# Extract all tables from all PDF files in the current folder and save them to the current folder
pdfsp . .

# Extract and COMBINE large tables (spanning multiple pages) into single files, saved to the current folder
pdfsp . . --combine

# Extract and COMBINE tables, skipping the first row of each table (e.g., header rows)
pdfsp . . --combine --skiprows=1

# Extract all tables from PDF files in 'someFolder' and save them to 'SomeOutFolder'
pdfsp someFolder SomeOutFolder

# Extract all tables from 'some.pdf' and save them to the current folder
pdfsp some.pdf .

# Extract all tables from 'some.pdf' and save them to 'toThisFolder'
pdfsp some.pdf toThisFolder



=== πŸ“Š Extraction Summary Report ===
βœ… Successful Files: 3
   - pdfs/report1.pdf β†’ πŸ—‚οΈ 5 tables extracted
   - pdfs/summary2.pdf β†’ πŸ—‚οΈ 3 tables extracted
   - pdfs/report2.pdf β†’ πŸ—‚οΈ 7 tables extracted

❌ Failed Files: 1
   - pdfs/corrupted.pdf

⚠️ Some files failed to process. See details above.


FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

About

Packages

Stay in touch

Get open source security insights delivered straight into your inbox.

  • Terms
  • Privacy
  • Security

Made with ⚑️ by Socket Inc

U.S. Patent No. 12,346,443 & 12,314,394. Other pending.