
Product
Rust Support in Socket Is Now Generally Available
Socket’s Rust and Cargo support is now generally available, providing dependency analysis and supply chain visibility for Rust projects.
invoice2data
Advanced tools
A command line tool and Python library to support your accounting process.
pdftotext, text, ocrmypdf, pdfminer, pdfplumber or OCR -- tesseract, or
gvision (Google Cloud Vision).With the flexible template system you can:
lines-plugin developed by Holger
BrunnGo from PDF files to this:
{'date': (2014, 5, 7), 'invoice_number': '30064443', 'amount': 34.73, 'desc': 'Invoice 30064443 from QualityHosting', 'lines': [{'price': 42.0, 'desc': u'Small Business StandardExchange 2010\nGrundgeb\xfchr pro Einheit\nDienst: OUDJQ_office\n01.05.14-31.05.14\n', 'pos': u'7', 'qty': 1.0}]}
{'date': (2014, 6, 4), 'invoice_number': 'EUVINS1-OF5-DE-120725895', 'amount': 35.24, 'desc': 'Invoice EUVINS1-OF5-DE-120725895 from Amazon EU'}
{'date': (2014, 8, 3), 'invoice_number': '42183017', 'amount': 4.11, 'desc': 'Invoice 42183017 from Amazon Web Services'}
{'date': (2015, 1, 28), 'invoice_number': '12429647', 'amount': 101.0, 'desc': 'Invoice 12429647 from Envato'}
flowchart LR
InvoiceFile[fa:fa-file-invoice Invoicefile\n\npdf\nimage\ntext] --> Input-module(Input Module\n\npdftotext\ntext\npdfminer\npdfplumber\ntesseract\ngvision)
Input-module --> |Extracted Text| C{keyword\nmatching}
Invoice-Templates[(fa:fa-file-lines Invoice Templates)] --> C{keyword\nmatching}
C --> |Extracted Text + fa:fa-file-circle-check Template| E(Template Processing\n apply options from template\nremove accents, replaces etc...)
E --> |Optimized String|Plugins&Parsers(Call plugins + parsers)
subgraph Plugins&Parsers
direction BT
tables[fa:fa-table tables] ~~~ lines[fa:fa-grip-lines lines]
lines ~~~ regex[fa:fa-code regex]
regex ~~~ static[fa:fa-check static]
end
Plugins&Parsers --> |output| result[result\nfa:fa-file-csv,\njson,\nXML]
click Invoice-Templates https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md
click result https://github.com/invoice-x/invoice2data#usage
click Input-module https://github.com/invoice-x/invoice2data#installation-of-input-modules
click E https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#options
click tables https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#tables
click lines https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#lines
click regex https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#regex
click static https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#parser-static
If possible get the latest
xpdf/poppler-utils version. It's
included with macOS Homebrew, Debian and Ubuntu. Without it, pdftotext
won't parse tables in PDF correctly.
Install invoice2data using pip
pip install invoice2data
An tesseract wrapper is included in auto language mode. It will test your input files against the languages installed on your system. To use it tesseract and imagemagick needs to be installed. tesseract supports multiple OCR engine modes. By default the available engine installed on the system will be used.
Languages: tesseract-ocr recognize more than 100 languages For Linux users, you can often find packages that provide language packs:
# Display a list of all Tesseract language packs
apt-cache search tesseract-ocr
# Debian/Ubuntu users
apt-get install tesseract-ocr-chi-sim # Example: Install Chinese Simplified language pack
# Arch Linux users
pacman -S tesseract-data-eng tesseract-data-deu # Example: Install the English and German language packs
Basic usage. Process PDF files and write result to CSV.
invoice2data invoice.pdfinvoice2data invoice.txtinvoice2data *.pdfChoose any of the following input readers:
invoice2data --input-reader pdftotext invoice.pdfinvoice2data --input-reader text invoice.txtinvoice2data --input-reader tesseract invoice.pdfinvoice2data --input-reader pdfminer invoice.pdfinvoice2data --input-reader pdfplumber invoice.pdfinvoice2data --input-reader ocrmypdf invoice.pdfinvoice2data --input-reader gvision invoice.pdf (needs GOOGLE_APPLICATION_CREDENTIALS env var)Choose any of the following output formats:
invoice2data --output-format csv invoice.pdfinvoice2data --output-format json invoice.pdfinvoice2data --output-format xml invoice.pdfSave output file with custom name or a specific folder
invoice2data --output-format csv --output-name myinvoices/invoices.csv invoice.pdf
Note: You must specify the output-format in order to create
output-name
Specify folder with yml templates. (e.g. your suppliers)
invoice2data --template-folder ACME-templates invoice.pdf
Only use your own templates and exclude built-ins
invoice2data --exclude-built-in-templates --template-folder ACME-templates invoice.pdf
Processes a folder of invoices and copies renamed invoices to new folder.
invoice2data --copy new_folder folder_with_invoices/*.pdf
Processes a single file and dumps whole file for debugging (useful when adding new templates in templates.py)
invoice2data --debug my_invoice.pdf
Recognize test invoices: invoice2data invoice2data/test/pdfs/* --debug
You can easily add invoice2data to your own Python scripts as library.
from invoice2data import extract_data
result = extract_data('path/to/my/file.pdf')
Using in-house templates
from invoice2data import extract_data
from invoice2data.extract.loader import read_templates
templates = read_templates('/path/to/your/templates/')
result = extract_data(filename, templates=templates)
See invoice2data/extract/templates for existing templates. Just extend
the list to add your own. If deployed by a bigger organisation, there
should be an interface to edit templates for new suppliers. 80-20 rule.
For a short tutorial on how to add new templates, see TUTORIAL.md.
Templates are based on Yaml or JSON. They define one or more keywords to find the right template, one or more exclude_keywords to further narrow it down and regexp for fields to be extracted. They could also be a static value, like the full company name.
Template files are tried in alphabetical order.
We may extend them to feature options to be used during invoice processing.
Example:
issuer: Amazon Web Services, Inc.
keywords:
- Amazon Web Services
exclude_keywords:
- San Jose
fields:
amount: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
amount_untaxed: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
date: Invoice Date:\s+([a-zA-Z]+ \d+ , \d+)
invoice_number: Invoice Number:\s+(\d+)
partner_name: (Amazon Web Services, Inc\.)
options:
remove_whitespace: false
currency: HKD
date_formats:
- '%d/%m/%Y'
lines:
start: Detail
end: \* May include estimated US sales tax
first_line: ^ (?P<description>\w+.*)\$(?P<price_unit>\d+\.\d+)
line: (.*)\$(\d+\.\d+)
skip_line: Note
last_line: VAT \*\*
The lines package has multiple settings:
:warning: Invoice2data uses a yaml templating system. The yaml templates are loaded with pyyaml which is a pure python implementation. (thus rather slow) As an alternative json templates can be used. Which are natively better supported by python.
The performance with yaml templates can be greatly increased 10x by using libyaml
It can be installed on most distributions by:
sudo apt-get libyaml-dev
If you are interested in improving this project, have a look at our developer guide to get you started quickly.
FAQs
Python parser to extract data from pdf invoice
We found that invoice2data demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 2 open source maintainers collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Product
Socket’s Rust and Cargo support is now generally available, providing dependency analysis and supply chain visibility for Rust projects.

Security News
Chrome 144 introduces the Temporal API, a modern approach to date and time handling designed to fix long-standing issues with JavaScript’s Date object.

Research
Five coordinated Chrome extensions enable session hijacking and block security controls across enterprise HR and ERP platforms.