Security News
Introducing the Socket Python SDK
The initial version of the Socket Python SDK is now on PyPI, enabling developers to more easily interact with the Socket REST API in Python projects.
A wrapper around the pdftoppm and pdftocairo command line tools to convert PDF to a PIL Image list.
A python (3.7+) module that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object
pip install pdf2image
Windows users will have to build or download poppler for Windows. I recommend @oschwartz10612 version which is the most up-to-date. You will then have to add the bin/
folder to PATH or use poppler_path = r"C:\path\to\poppler-xx\bin" as an argument
in convert_from_path
.
Mac users will have to install poppler.
Installing using Brew:
brew install poppler
Most distros ship with pdftoppm
and pdftocairo
. If they are not installed, refer to your package manager to install poppler-utils
conda
)conda install -c conda-forge poppler
pip install pdf2image
from pdf2image import convert_from_path, convert_from_bytes
from pdf2image.exceptions import (
PDFInfoNotInstalledError,
PDFPageCountError,
PDFSyntaxError
)
Then simply do:
images = convert_from_path('/home/belval/example.pdf')
OR
images = convert_from_bytes(open('/home/belval/example.pdf', 'rb').read())
OR better yet
import tempfile
with tempfile.TemporaryDirectory() as path:
images_from_path = convert_from_path('/home/belval/example.pdf', output_folder=path)
# Do something here
images
will be a list of PIL Image representing each page of the PDF document.
Here are the definitions:
convert_from_path(pdf_path, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600, hide_attributes=False)
convert_from_bytes(pdf_file, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600, hide_attributes=False)
hide_attributes
(Thank you @StaticRocket)timeout
parameter which raises PDFPopplerTimeoutError
after the given number of seconds.use_pdftocairo
parameter which forces pdf2image
to use pdftocairo
. Should improve performance.pdf2image
with multiple threads (but not multiple processes) would cause and exceptionjpegopt
parameter allows for tuning of the output JPEG when using fmt="jpeg"
(-jpegopt
in pdftoppm CLI) (Thank you @abieler)pdfinfo_from_path
and pdfinfo_from_bytes
which expose the output of the pdfinfo CLIpaths_only
parameter will return image paths instead of Image objects, to prevent OOM when converting a big PDFsize
parameter allows you to define the shape of the resulting images (-scale-to
in pdftoppm CLI)
size=400
will fit the image to a 400x400 box, preserving aspect ratiosize=(400, None)
will make the image 400 pixels wide, preserving aspect ratiosize=(500, 500)
will resize the image to 500x500 pixels, not preserving aspect ratiograyscale
parameter allows you to convert images to grayscale (-gray
in pdftoppm CLI)single_file
parameter allows you to convert the first PDF page only, without adding digits at the end of the output_file
poppler_path
python tests.py
to get timings.FAQs
A wrapper around the pdftoppm and pdftocairo command line tools to convert PDF to a PIL Image list.
We found that pdf2image demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The initial version of the Socket Python SDK is now on PyPI, enabling developers to more easily interact with the Socket REST API in Python projects.
Security News
Floating dependency ranges in npm can introduce instability and security risks into your project by allowing unverified or incompatible versions to be installed automatically, leading to unpredictable behavior and potential conflicts.
Security News
A new Rust RFC proposes "Trusted Publishing" for Crates.io, introducing short-lived access tokens via OIDC to improve security and reduce risks associated with long-lived API tokens.