


š pdfsp
pdfsp
is a Python package that extracts tables from PDF files and saves them to Excel. It also provides a simple Streamlit app for interactive viewing of the extracted data.
š Features
- Extracts tabular data from PDFs using
pdfplumber
- Converts tables into
pandas
DataFrames
- Saves output as
.xlsx
Excel files using openpyxl
- Ensures column names are unique to prevent issues
- Visualizes DataFrames with
streamlit
š¦ Installation
Make sure you're using Python 3.10 or newer, then install with:
pip install pdfsp -U
python script
from pdfsp import extract_tables, Options
source_folder = "."
output_folder = "output"
combine_tables = True
options = Options(
source_folder=source_folder,
output_folder=output_folder,
combine=combine_tables
)
extract_tables(options)
From console / Terminal / Command Line
pdfsp . .
pdfsp . . --combine
pdfsp . . --combine --skiprows=1
pdfsp someFolder SomeOutFolder
pdfsp some.pdf .
pdfsp some.pdf toThisFolder
=== š Extraction Summary Report ===
ā
Successful Files: 3
- pdfs/report1.pdf ā šļø 5 tables extracted
- pdfs/summary2.pdf ā šļø 3 tables extracted
- pdfs/report2.pdf ā šļø 7 tables extracted
ā Failed Files: 1
- pdfs/corrupted.pdf
ā ļø Some files failed to process. See details above.