pdf-reader-turtletext

Package Overview

Dependencies

Maintainers

Alerts

File Explorer

Advanced tools

License

Install Socket

Detect and block malicious and high-risk dependencies

Install

pdf-reader-turtletext

Rubygems

Version: 0.2.2

Version published: 13 years ago

Maintainers: 1

Created: 13 years ago

Source

= PDF::Reader::Turtletext {}[http://travis-ci.org/tardate/pdf-reader-turtletext]

PDF::Reader::Turtletext is an extension for the most excellent {PDF::Reader}[https://github.com/yob/pdf-reader] gem.

The aim of Turtletext is to provide simple and convenient methods for extracting PDF text content and converting it to structured data - even when there is no explicit structure in the original PDF source.

A typical use is to extract details from utility bills that are provided in PDF format, to open up the data for analysis and other secondary uses.

For an example of how this is works in practice, see the {sps_bill}[https://github.com/tardate/sps_bill_scanner/] gem (which is in fact the project where the original ideas for Turtletext gestated).

== Requirements and Known Limitations

Tested with MRI 1.8.7, 1.9.2, 1.9.3, Rubinius (1.8 and 1.9 mode), JRuby (1.8 and 1.9 mode)
Has a fixed dependency on PDF::Reader v1.1.1

== The PDF::Reader::Turtletext Cookbook

=== How do I install it for normal use?

It is distributed as a gem, so all normal gem installation procedures apply. To install the gem directly from the command line:

$ gem install pdf-reader-turtletext

If you are using bundler or Rails, add to your Gemfile:

gem 'pdf-reader-turtletext'

Then bundle install:

$ bundle

=== How do I install it for gem development?

If you want to work on enhancements of fix bugs in PDF::Reader::Turtletext, fork and clone the github repository. If you are using bundler (recommended), run bundle to install development dependencies.

See the section below on 'Contributing to PDF::Reader::Turtletext' for more information.

=== How to instantiate Turtletext in code

All interaction is done using an instance of the PDF::Reader::Turtletext class. It is initialised given a filename or IO-like object, and any required options.

Typical usage:

pdf_filename = '../some_path/some.pdf' reader = PDF::Reader::Turtletext.new(pdf_filename) options = { :y_precision => 5 } reader_with_options = PDF::Reader::Turtletext.new(pdf_filename,options)

=== How to extract text within a region described in relation to other text

Problem: we don't know exactly where the required text will be on the page, and it is not encoded within the PDF as a single object. But we do know that it will be relatively positioned (for example) below a certain bit of text, to the left of another, and above some other text.

Solution: use the bounding_box method to describe the region and extract the matching text.

textangle = reader.bounding_box do page 1 below /electricity/i above 10 right_of 240.0 left_of "Total ($)" end textangle.text => [['string','string'],['string']] # array of rows, each row is an array of text elements in the row

The range of methods that can be used within the bounding_box block are all optional, and include:

page - specifies the PDF page from which to extract text (default is 1).
inclusive - whether region selection should be inclusive or exclusive of the specified positions (default is false).
below - a string, regex or number that describes the upper limit of the text box (default is top border of the page).
above - a string, regex or number that describes the lower limit of the text box (default is bottom border of the page).
left_of - a string, regex or number that describes the right limit of the text box (default is right border of the page).
right_of - a string, regex or number that describes the left limit of the text box (default is left border of the page).

Note that left_of and right_of constraints do not need to be within the vertical range of the box being described. For example, you could use an element in the page header to describe the left_of limit for a table at the bottom of the page, if it has the correct alignment needed to describe your text region.

Similarly, above and below constraints do not need to be within the horizontal range of the box being described.

=== Using a block parameter with the bounding_box method

An explicit block parameter may be used with the bounding_box method:

textangle = reader.bounding_box do |r| r.below /electricity/i r.left_of "Total ($)" end textangle.text => [['string','string'],['string']] # array of rows, each row is an array of text elements in the row

=== How to describe an inclusive bounding_box region

By default, the bounding_box method makes exclusive selection (i.e. not including the region limits).

To specifiy an inclusive region, use the inclusive! command:

textangle = reader.bounding_box do inclusive! below /electricity/i left_of "Total ($)" end

Alternatively, set inclusive to true:

textangle = reader.bounding_box do inclusive true below /electricity/i left_of "Total ($)" end

Or with a block parameter, you may also assign inclusive to true:

textangle = reader.bounding_box do |r| r.inclusive = true r.below /electricity/i r.left_of "Total ($)" end

=== Extract text for a region with known positional co-ordinates

If you know (or can calculate) the x,y positions of the required text region, you can extract the region's text using the text_in_region method.

text = reader.text_in_region( 10, # minimum x (left-most) 900, # maximum x (right-most) 200, # minimum y (bottom-most) 400, # maximum y (top-most) 1, # page (default 1) false # inclusive of x/y position if true (default false) ) => [['string','string'],['string']] # array of rows, each row is an array of text elements in the row

Note that the x,y origin is at the bottom-left of the page.

=== How to find the x,y co-ordinate of a specific text element

Problem: if you are doing low-level text extraction with text_in_region for example, it is usually necessary to locate specific text to provide a positional reference.

Solution: use the text_position method to locate text by exact or partial match. It returns a Hash of x/y co-ordinates that is the bottom-left corner of the text.

page = 1 text_by_exact_match = reader.text_position("Transaction Table", page) => { :x => 10.0, :y => 600.0 } text_by_regex_match = reader.text_position(/transaction summary/i, page) => { :x => 10.0, :y => 300.0 }

Note: in the case of multitple matches, only the first match is returned.

== Contributing to PDF::Reader::Turtletext

Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yet
Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it
Fork the project
Start a feature/bugfix branch
Commit and push until you are happy with your contribution
Make sure to add tests for it. This is important so I don't break it in a future version unintentionally.
Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.

== Copyright

FAQs

What is pdf-reader-turtletext?

Is pdf-reader-turtletext well maintained?

Package last updated on 01 Aug 2012

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

pdf-reader-turtletext

Related posts

Introducing Custom Pull Request Alert Comment Headers

Rust Support Now in Beta