Security News
Introducing the Socket Python SDK
The initial version of the Socket Python SDK is now on PyPI, enabling developers to more easily interact with the Socket REST API in Python projects.
PDFMiner is a text extraction tool for PDF documents.
Warning: Starting from version 20191010, PDFMiner supports Python 3 only. For Python 2 support, check out pdfminer.six.
> pip install pdfminer
> pdf2txt.py samples/simple1.pdf
pdf2txt.py extracts all the texts that are rendered programmatically. It also extracts the corresponding locations, font names, font sizes, writing direction (horizontal or vertical) for each text segment. It does not recognize text in images. A password needs to be provided for restricted PDF documents.
> pdf2txt.py [-P password] [-o output] [-t text|html|xml|tag]
[-O output_dir] [-c encoding] [-s scale] [-R rotation]
[-Y normal|loose|exact] [-p pagenos] [-m maxpages]
[-S] [-C] [-n] [-A] [-V]
[-M char_margin] [-L line_margin] [-W word_margin]
[-F boxes_flow] [-d]
input.pdf ...
-P password
: PDF password.-o output
: Output file name.-t text|html|xml|tag
: Output type. (default: automatically inferred from the output file name.)-O output_dir
: Output directory for extracted images.-c encoding
: Output encoding. (default: utf-8)-s scale
: Output scale.-R rotation
: Rotates the page in degree.-Y normal|loose|exact
: Specifies the layout mode. (only for HTML output.)-p pagenos
: Processes certain pages only.-m maxpages
: Limits the number of maximum pages to process.-S
: Strips control characters.-C
: Disables resource caching.-n
: Disables layout analysis.-A
: Applies layout analysis for all texts including figures.-V
: Automatically detects vertical writing.-M char_margin
: Speficies the char margin.-W word_margin
: Speficies the word margin.-L line_margin
: Speficies the line margin.-F boxes_flow
: Speficies the box flow ratio.-d
: Turns on Debug output.dumppdf.py is used for debugging PDFs. It dumps all the internal contents in pseudo-XML format.
> dumppdf.py [-P password] [-a] [-p pageid] [-i objid]
[-o output] [-r|-b|-t] [-T] [-O directory] [-d]
input.pdf ...
-P password
: PDF password.-a
: Extracts all objects.-p pageid
: Extracts a Page object.-i objid
: Extracts a certain object.-o output
: Output file name.-r
: Raw mode. Dumps the raw compressed/encoded streams.-b
: Binary mode. Dumps the uncompressed/decoded streams.-t
: Text mode. Dumps the streams in text format.-T
: Tagged mode. Dumps the tagged contents.-O output_dir
: Output directory for extracted streams.FAQs
PDF parser and analyzer
We found that pdfminer demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The initial version of the Socket Python SDK is now on PyPI, enabling developers to more easily interact with the Socket REST API in Python projects.
Security News
Floating dependency ranges in npm can introduce instability and security risks into your project by allowing unverified or incompatible versions to be installed automatically, leading to unpredictable behavior and potential conflicts.
Security News
A new Rust RFC proposes "Trusted Publishing" for Crates.io, introducing short-lived access tokens via OIDC to improve security and reduce risks associated with long-lived API tokens.