Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More →

tabula-extractor

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

tabula-extractor

0.8.0
Rubygems

Version published: 9 years ago

Maintainers: 3

Created: 9 years ago

Source

tabula-extractor

Extract tables from PDF files. tabula-extractor is the table extraction engine that powers Tabula, now available as a library and command line program.

Versions 0.9.6 and greater of Tabula can export shell scripts using tabula-extractor for bulk extraction.

Installation

tabula-extractor only works with JRuby 1.7 or newer. Install JRuby and run

jruby -S gem install tabula-extractor

Usage

Tabula helps you extract tables from PDFs

Usage:
       tabula [options] <pdf_file>
where [options] are:
Tabula helps you extract tables from PDFs
       --pages, -p <s>:   Comma separated list of ranges. Examples: --pages
                          1-3,5-7 or --pages 3. Default is --pages 1 (default:
                          1)
        --area, -a <s>:   Portion of the page to analyze
                          (top,left,bottom,right). Example: --area
                          269.875,12.75,790.5,561. Default is entire page
     --columns, -c <s>:   X coordinates of column boundaries. Example --columns
                          10.1,20.2,30.3
    --password, -s <s>:   Password to decrypt document. Default is empty
                          (default: )
           --guess, -g:   Guess the portion of the page to analyze per page.
           --debug, -d:   Print detected table areas instead of processing.
      --format, -f <s>:   Output format (CSV,TSV,HTML,JSON) (default: CSV)
     --outfile, -o <s>:   Write output to <file> instead of STDOUT (default: -)
     --spreadsheet, -r:   Force PDF to be extracted using spreadsheet-style
                          extraction (if there are ruling lines separating each
                          cell, as in a PDF of an Excel spreadsheet)
  --no-spreadsheet, -n:   Force PDF not to be extracted using spreadsheet-style
                          extraction (if there are ruling lines separating each
                          cell, as in a PDF of an Excel spreadsheet)
          --silent, -i:   Suppress all stderr output.
--use-line-returns, -u:   Use embedded line returns in cells.
         --version, -v:   Print version and exit
            --help, -h:   Show this message

Command Line Examples

These examples use documents contained with tabula-extractor's test folder. If you want to follow along, download the document and give it a shot. There's more extensive explanation here.

Extract all the tables from a document into a spreadsheet called output.csv:

tabula test/heuristic-test-set/spreadsheet/tabla_subsidios.pdf -o output.csv

Extract only the tables on page 1 into a spreadsheet called output.csv:

tabula --pages 1 test/heuristic-test-set/spreadsheet/strongschools.pdf -o output.csv

Extract only the tables on page 1 into a CSV spreadsheet onto STDOUT (that is, print it out in your terminal window):

tabula --pages 1 test/heuristic-test-set/spreadsheet/strongschools.pdf

Extract the data from the table contained within a certain area on page 1 into a spreadsheet called output.csv:

tabula test/data/vertical_rulings_bug.pdf --area 250,0,325,1700  --pages 1 -o output.csv

Extract all the tables from a document into a tab-separated spreadsheet called output.tsv:

tabula test/heuristic-test-set/spreadsheet/strongschools.pdf output.tsv --format TSV #should exclude guff

Extract the table from page 1, using specified locations for column boundaries, into a spreadsheet called output.csv:

tabula test/data/campaign_donors.pdf -o output.csv --columns 47,147,256,310,375,431,504

Scripting examples

tabula-extractor is also a RubyGem that you can use to programmatically extract tabular data, using the Tabula engine, in your scripts or applications. We don't have docs yet, but the tests are a good source of information.

Here's a very basic example, using the "spreadsheet" extraction method:

require 'tabula'

pdf_file_path = "whatever.pdf"
outfilename = "whatever.csv"

out = open(outfilename, 'w')

extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, :all )
extractor.extract.each do |pdf_page|
  pdf_page.spreadsheets.each do |spreadsheet|
    out << spreadsheet.to_csv
    out << "\n\n"
  end
end
out.close

Here's another example using the "original" extraction method, which is useful for tables that don't have ruling lines separating the rows and cells. This example extracts data from only pages 1 and 2.

require 'tabula'

pdf_file_path = "whatever.pdf"
outfilename = "whatever.csv"

out = open(outfilename, 'w')

extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, 1..2)
extractor.extract.each_with_index do |pdf_page, page_index|
  page_areas = [[250, 0, 325, 1700]]

  page_areas.each do |page_area|
    out << pdf_page.get_area(page_area).get_table.to_csv
    out << "\n\n"
  end

end
extractor.close!
out.close

This similar example using the "original" extraction method, but specifies the location of columns. This is a useful tactic when crappy PDF creation software let one column's text flow into the next column. Unless you specify column locations manually, Tabula would combine the two columns. You can find the column locations using a screen ruler; I find it works well to measure the width of the entire PDF and scale the locations based on the width of the page as PDFBox renders it, as shown in the example below.

require 'tabula'

pdf_file_path = "whatever.pdf"
outfilename = "whatever.csv"

out = open(outfilename, 'w')

extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, 1..2)
extractor.extract.each_with_index do |pdf_page, page_index|
  page_areas = [[250, 0, 325, 1700]]

  scale_factor = pdf_page.width / 1700 
  # where 1700 is the width of the page as you measured it.

  vertical_ruling_locations = [0, 360, 506, 617, 906, 1034, 1160, 1290, 1418, 1548] #column locations
  vertical_rulings = vertical_ruling_locations.map{|n| Tabula::Ruling.new(0, n * scale_factor, 0, 1000)}

  page_areas.each do |page_area|
    out << pdf_page.get_area(page_area).get_table(:vertical_rulings => vertical_rulings).to_csv
    out << "\n\n"
  end
end
extractor.close!
out.close
```

`tabula-extractor` has also been used successfully as a part of data extraction pipelines. [This blog post](http://open.blogs.nytimes.com/2015/04/03/purifying-the-sea-of-pdf-data-automatically/) discusses a possible pattern for creating these and includes a few examples:

- Sierra Leone’s Ebola situation reports: [GitHub](https://github.com/jeremybmerrill/ebola_parsers/tree/master/sierra_leone)
- The NYPD’s CompStat criminal complaints database weekly reports: [GitHub](https://github.com/nytinteractive/compstat_parser)
- The NYPD’s monthly reports of moving summonses: [GitHub](https://github.com/nytinteractive/moving_summonses_parser)


## How Does This Work? Like, Theoretically?

PDFs are a terrible format for transmitting tabular data. Tabula uses two algorithms to try to reconstruct the underlying structure of the data table. This section describes how PDFs represent your data and how Tabula extracts it so you can use `tabula-extractor` productively.

PDFs were designed to represent a paper document's layout across various computers and on paper, so they focus on precise positioning. They include primitives for text strings, geometric shapes, images and videos (and more), but no data tables. Tabula includes a Java library called PDFBox to access those embedded text strings and geometric shapes and uses them to reconstruct your table. 

<em style="margin-left: 5px;"> Why Can't Tabula Process Scanned Pages? Scanned PDF pages usually contain only one primitive: the image of the scanned page. Since those PDFs don't contain text strings or geometric shapes, Tabula won't be able to reconstruct your data -- unless you run the PDF through an OCR (Optical Character Recognition) program, which re-inserts those text strings into their original position, though the results can be error prone.</em>

Tabula has two distinct algorithms to use for different kinds of tables. It uses a heuristic to try to guess which algorithm to use for each table, but this heuristic is wrong fairly often, so you may need to specify which algorithm to use, using the Extraction Method selector buttons in the GUI or the `spreadsheet` or `no-spreadsheet` flags on the command line.

- The `spreadsheet` algorithm uses geometric lines to reconstruct the table structure. After discarding oblique lines, the algorithm finds all of the lines' crossing points. Using those crossing points, it creates a large list of minimal rectangular areas (that is, rectangles that contain no other rectangles) that are spreadsheet cells. The minimum bounding box of groups of adjacent cells is a table (called a Spreadsheet object). After spreadsheet objects are created, empty "placeholder" cells are created when a cell in one row (or, likewise, column) spans over a space in which multiple cells are contained in another row. Once we have the dimensions of all the cells on the page, the PDFBox library can get the text contained within each cell. 
- The `original` or `no-spreadsheet` algorithm uses only the position of text element on the page. (Because OCR software doesn't reconstruct lines, this algorithm is the only algorithm available for OCRed PDFs.) The algorithm collects all the text on the page (or within the area of the page that contains a table, specified with the Tabula GUI or the `--area` flag) and finds "rivers" -- vertical spaces that don't contain any text for the entire height of the table. These are considered column boundaries. (If text from one column flows into another column because the PDF was created with crappy software, you can specify it manually with the `--columns` flag ) Each line of text on the page (by unique y locations) is considered a separate line in the table. (If cells contain multiple rows, you may have to write a script to "roll them up" -- Tabula can't provide this functionality.) 

These two algorithms are inspired by some academic work, including Anssi Nurminen's "[Algorithmic Extraction of Data in Tables in Pdf Documents](http://dspace.cc.tut.fi/dpub/bitstream/handle/123456789/21520/Nurminen.pdf?sequence=3)" (2013) for the spreadsheet algorithm.

## Documentation

You're welcome to try to integrate the `tabula-extractor` gem into your project. We don't really have documentation yet, though the tests may be a good source. If you're going to, please feel free to drop us a note and we may be able to give you some pointers.

FAQs

What is tabula-extractor?

Is tabula-extractor well maintained?

Package last updated on 20 Aug 2015

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

tabula-extractor

tabula-extractor

Installation

Usage

Command Line Examples

Scripting examples

Related posts

Tech's $90B Ghost Engineer Problem: Stanford Study Finds 9.5% of Engineers Do Almost Nothing

Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries