pdftotext
Simple PDF text extraction
import pdftotext
with open("lorem_ipsum.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
with open("secure.pdf", "rb") as f:
pdf = pdftotext.PDF(f, "secret")
print(len(pdf))
for page in pdf:
print(page)
print(pdf[0])
print(pdf[1])
print("\n\n".join(pdf))
OS Dependencies
These instructions assume you're on a recent OS. Package names may differ for an
older OS.
Debian, Ubuntu, and friends
sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev
Fedora, Red Hat, and friends
sudo yum install gcc-c++ pkgconfig poppler-cpp-devel python3-devel
macOS
brew install pkg-config poppler python
Windows
Currently tested only when using conda:
Install
pip install pdftotext