Socket
Socket
Sign inDemoInstall

pdfminer.six

Package Overview
Dependencies
10
Maintainers
3
Alerts
File Explorer

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

    pdfminer.six

PDF parser and analyzer


Maintainers
3

Readme

pdfminer.six

Continuous integration PyPI version gitter

We fathom PDF

Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data. Pdfminer.six extracts the text from a page directly from the sourcecode of the PDF. It can also be used to get the exact location, font or color of the text.

It is built in a modular way such that each component of pdfminer.six can be replaced easily. You can implement your own interpreter or rendering device that uses the power of pdfminer.six for other purposes than text analysis.

Check out the full documentation on Read the Docs.

Features

  • Written entirely in Python.
  • Parse, analyze, and convert PDF documents.
  • Extract content as text, images, html or hOCR.
  • PDF-1.7 specification support. (well, almost).
  • CJK languages and vertical writing scripts support.
  • Various font types (Type1, TrueType, Type3, and CID) support.
  • Support for extracting images (JPG, JBIG2, Bitmaps).
  • Support for various compressions (ASCIIHexDecode, ASCII85Decode, LZWDecode, FlateDecode, RunLengthDecode, CCITTFaxDecode)
  • Support for RC4 and AES encryption.
  • Support for AcroForm interactive form extraction.
  • Table of contents extraction.
  • Tagged contents extraction.
  • Automatic layout analysis.

How to use

  • Install Python 3.6 or newer.

  • Install pdfminer.six.

    pip install pdfminer.six

  • (Optionally) install extra dependencies for extracting images.

    pip install 'pdfminer.six[image]'

  • Use the command-line interface to extract text from pdf.

    pdf2txt.py example.pdf

  • Or use it with Python.

from pdfminer.high_level import extract_text

text = extract_text("example.pdf")
print(text)

Contributing

Be sure to read the contribution guidelines.

Acknowledgement

This repository includes code from pyHanko ; the original license has been included here.

Keywords

FAQs


Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc