pdftotree
|License| |Stars| |PyPI| |Version| |Issues| |CI-CD| |Codecov| |CodeStyle|
WARNING: pdftotree
is experimental code and is NOT stable. It is not integrated with or supported by Fonduer.
Fonduer_ performs knowledge base construction from richly formatted data such
as tables. A crucial step in this process is the construction of the
hierarchical tree of context objects such as text blocks, figures, tables, etc.
The system currently uses PDF to HTML conversion provided by Adobe Acrobat.
However, Adobe Acrobat is not an open source tool, which may be inconvenient
for Fonduer users.
This package is the result of building our own module as replacement to Adobe
Acrobat. Several open source tools are available for pdf to html conversion but
these tools do not preserve the cell structure in a table. Our goal in this
project is to develop a tool that extracts text, figures and tables in a pdf
document and returns them in an easily consumable format.
Up to v0.4.1, pdftotree's output was formatted in its own "HTML-like" format.
From v0.5.0, it conforms to hOCR_, an open-standard format for OCR results.
Dependencies
pdftotree depends on the following native libraries:
- ImageMagick 6+ (for Wand)
- Java 8+ (for tabula-py)
Installation
To install this package from PyPi::
$ pip install pdftotree
Usage
pdftotree as a Python package
.. code:: python
import pdftotree
pdftotree.parse(pdf_file, html_path=None, model_type=None, model_path=None, visualize=False):
pdftotree
~~~~~~~~~
This is the primary command-line utility provided with this Python package.
This takes a PDF file as input and produces an hOCR file as output::
usage: pdftotree [options] pdf_file
Convert PDF into hOCR.
positional arguments:
pdf_file Path to input PDF file.
optional arguments:
-h, --help show this help message and exit
-mt {vision,ml,None}, --model_type {vision,ml,None}
Model type to use. None (default) for heuristics
approach.
-m MODEL_PATH, --model_path MODEL_PATH
Pretrained model, generated by extract_tables tool
-o OUTPUT, --output OUTPUT
Path to output hOCR file. If not given, it will be
printed to stdout.
-V, --visualize Whether to output visualization images
-v, --verbose Output INFO level logging.
-vv, --veryverbose Output DEBUG level logging.
extract\_tables
~~~~~~~~~~~~~~~
This tool trains a machine-learning model to extract tables. The output model
can be used as an input to ``pdftotree``::
usage: extract_tables [-h] [--mode MODE] --model-path MODEL_PATH
[--train-pdf TRAIN_PDF] --test-pdf TEST_PDF
[--gt-train GT_TRAIN] --gt-test GT_TEST --datapath
DATAPATH [--iou-thresh IOU_THRESH] [-v] [-vv]
Script to extract tables bounding boxes from PDF files using machine learning.
If `model.pkl` is saved in the model-path, the pickled model will be used for
prediction. Otherwise the model will be retrained. If --mode is test (by
default), the script will create a .bbox file containing the tables for the
pdf documents listed in the file --test-pdf. If --mode is dev, the script will
also extract ground truth labels for the test data and compute statistics.
optional arguments:
-h, --help show this help message and exit
--mode MODE Usage mode dev or test, default is test
--model-path MODEL_PATH
Path to the model. If the file exists, it will be
used. Otherwise, a new model will be trained.
--train-pdf TRAIN_PDF
List of pdf file names used for training. These files
must be saved in the --datapath directory. Required if
no pretrained model is provided.
--test-pdf TEST_PDF List of pdf file names used for testing. These files
must be saved in the --datapath directory.
--gt-train GT_TRAIN Ground truth train tables. Required if no pretrained
model is provided.
--gt-test GT_TEST Ground truth test tables.
--datapath DATAPATH Path to directory containing the input documents.
--iou-thresh IOU_THRESH
Intersection over union threshold to remove duplicate
tables
-v Output INFO level logging
-vv Output DEBUG level logging
PDF List Format
The list of PDFs are simply a single filename on each line. For example::
1-s2.0-S000925411100369X-main.pdf
1-s2.0-S0009254115301030-main.pdf
1-s2.0-S0012821X12005717-main.pdf
1-s2.0-S0012821X15007487-main.pdf
1-s2.0-S0016699515000601-main.pdf
Ground Truth File Format
The ground truth is formatted to mirror the PDF List. That is, the first line
of the ground truth file provides the labels for the first document in
corresponding PDF list. Labels take the form of semicolon-separated tuples
containing the values ``(page_num, page_width, page_height, top, left,
bottom, right)``. For example::
(10, 696, 951, 634, 366, 832, 653);(14, 696, 951, 720, 62, 819, 654);(4, 696, 951, 152, 66, 813, 654);(7, 696, 951, 415, 57, 833, 647);(8, 696, 951, 163, 370, 563, 652)
(11, 713, 951, 97, 47, 204, 676);(11, 713, 951, 261, 45, 357, 673);(3, 713, 951, 110, 44, 355, 676);(8, 713, 951, 763, 55, 903, 687)
(5, 672, 951, 88, 57, 203, 578);(5, 672, 951, 593, 60, 696, 579)
(5, 718, 951, 131, 382, 403, 677)
(13, 713, 951, 119, 56, 175, 364);(13, 713, 951, 844, 57, 902, 363);(14, 713, 951, 109, 365, 164, 671);(8, 713, 951, 663, 46, 890, 672)
One method to label these tables is to use DocumentAnnotation_, which allows
you to select table regions in your web browser and produces the bounding box
file.
Example Dataset: Paleontological Papers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
A full set of documents and ground truth labels can be downloaded here:
PaleoDocs_. You can train a machine-learning model to extract table regions by
downloading this dataset and extracting it into a directory named ``data`` and
then running the command below. Double check that the paths in the command
match wherever you have downloaded the data::
$ extract_tables --train-pdf data/paleo/ml/train.pdf.list.paleo.not.scanned --gt-train data/paleo/ml/gt.train --test-pdf data/paleo/ml/test.pdf.list.paleo.not.scanned --gt-test data/paleo/ml/gt.test --datapath data/paleo/documents/ --model-path data/model.pkl
The resulting model of this example command would be saved as
``data/model.pkl``.
For Developers
--------------
We are following `Semantic Versioning 2.0.0 <https://semver.org/>`__
conventions. The maintainers will create a git tag for each release and
increment the version number found in the `version file`_ accordingly. We
deploy tags to PyPI automatically using GitHub Actions.
Tests
~~~~~
To test changes in the package, you install it in `editable mode`_ locally in
your virtualenv by running::
$ make dev
This will also install all the tools we use to enforce code-style.
Then you can run our tests::
$ make test
Release
~~~~~~~
Follow the below steps to release
1. Make commits with the following changes:
1. Update the CHANGELOG
2. Change the version at `pdftotree/_version.py` to `0.X.Y`.
2. Submit the commits as a pull-request
3. Once the pull-request is merged, add a tag `v0.X.Y` (don't forget "v" at the beginning) and push it
4. Pushing the tag triggers GitHub Actions workflow that
1. Creates a pre-release on GitHub
2. Publishes a package to PyPI
5. Edit the pre-release and release it
6. Increment the version to `0.X.(Y+1)+dev`
.. |License| image:: https://img.shields.io/github/license/HazyResearch/pdftotree.svg
:target: https://github.com/HazyResearch/pdftotree/blob/master/LICENSE
.. |Stars| image:: https://img.shields.io/github/stars/HazyResearch/pdftotree.svg
:target: https://github.com/HazyResearch/pdftotree/stargazers
.. |PyPI| image:: https://img.shields.io/pypi/v/pdftotree.svg
:target: https://pypi.python.org/pypi/pdftotree
.. |Version| image:: https://img.shields.io/pypi/pyversions/pdftotree.svg
:target: https://pypi.python.org/pypi/pdftotree
.. |Issues| image:: https://img.shields.io/github/issues/HazyResearch/pdftotree.svg
:target: https://github.com/HazyResearch/pdftotree/issues
.. |CI-CD| image:: https://img.shields.io/github/workflow/status/HazyResearch/pdftotree/test.svg
:target: https://github.com/HazyResearch/pdftotree/actions
.. |Codecov| image:: https://img.shields.io/codecov/c/github/HazyResearch/pdftotree
:target: https://codecov.io/gh/HazyResearch/pdftotree
.. |CodeStyle| image:: https://img.shields.io/badge/code%20style-black-000000.svg
:target: https://github.com/ambv/black
.. _Fonduer: https://github.com/HazyResearch/fonduer
.. _DocumentAnnotation: https://github.com/payalbajaj/DocumentAnnotation
.. _PaleoDocs: http://i.stanford.edu/hazy/share/fonduer/pdftotree_paleo.tar.gz
.. _version file: https://github.com/HazyResearch/pdftotree/blob/master/pdftotree/_version.py
.. _editable mode: https://packaging.python.org/tutorials/distributing-packages/#working-in-development-mode
.. _flake8: http://flake8.pycqa.org/en/latest/
.. _hOCR: http://kba.cloud/hocr-spec/1.2/