
Security News
CISA Kills Off RSS Feeds for KEVs and Cyber Alerts
CISA is discontinuing official RSS support for KEV and cybersecurity alerts, shifting updates to email and social media, disrupting automation workflows.
PDFMiner is a text extraction tool for PDF documents.
Warning: Starting from version 20191010, PDFMiner supports Python 3 only. For Python 2 support, check out pdfminer.six.
> pip install pdfminer
> pdf2txt.py samples/simple1.pdf
pdf2txt.py extracts all the texts that are rendered programmatically. It also extracts the corresponding locations, font names, font sizes, writing direction (horizontal or vertical) for each text segment. It does not recognize text in images. A password needs to be provided for restricted PDF documents.
> pdf2txt.py [-P password] [-o output] [-t text|html|xml|tag]
[-O output_dir] [-c encoding] [-s scale] [-R rotation]
[-Y normal|loose|exact] [-p pagenos] [-m maxpages]
[-S] [-C] [-n] [-A] [-V]
[-M char_margin] [-L line_margin] [-W word_margin]
[-F boxes_flow] [-d]
input.pdf ...
-P password
: PDF password.-o output
: Output file name.-t text|html|xml|tag
: Output type. (default: automatically inferred from the output file name.)-O output_dir
: Output directory for extracted images.-c encoding
: Output encoding. (default: utf-8)-s scale
: Output scale.-R rotation
: Rotates the page in degree.-Y normal|loose|exact
: Specifies the layout mode. (only for HTML output.)-p pagenos
: Processes certain pages only.-m maxpages
: Limits the number of maximum pages to process.-S
: Strips control characters.-C
: Disables resource caching.-n
: Disables layout analysis.-A
: Applies layout analysis for all texts including figures.-V
: Automatically detects vertical writing.-M char_margin
: Speficies the char margin.-W word_margin
: Speficies the word margin.-L line_margin
: Speficies the line margin.-F boxes_flow
: Speficies the box flow ratio.-d
: Turns on Debug output.dumppdf.py is used for debugging PDFs. It dumps all the internal contents in pseudo-XML format.
> dumppdf.py [-P password] [-a] [-p pageid] [-i objid]
[-o output] [-r|-b|-t] [-T] [-O directory] [-d]
input.pdf ...
-P password
: PDF password.-a
: Extracts all objects.-p pageid
: Extracts a Page object.-i objid
: Extracts a certain object.-o output
: Output file name.-r
: Raw mode. Dumps the raw compressed/encoded streams.-b
: Binary mode. Dumps the uncompressed/decoded streams.-t
: Text mode. Dumps the streams in text format.-T
: Tagged mode. Dumps the tagged contents.-O output_dir
: Output directory for extracted streams.FAQs
PDF parser and analyzer
We found that pdfminer demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
CISA is discontinuing official RSS support for KEV and cybersecurity alerts, shifting updates to email and social media, disrupting automation workflows.
Security News
The MCP community is launching an official registry to standardize AI tool discovery and let agents dynamically find and install MCP servers.
Research
Security News
Socket uncovers an npm Trojan stealing crypto wallets and BullX credentials via obfuscated code and Telegram exfiltration.