Socket
Book a DemoInstallSign in
Socket

pdftojson

Package Overview
Dependencies
Maintainers
1
Versions
3
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

pdftojson

pdftotext wrapper that generates JSON with bounding box data. Takes care of duplicate characters.

latest
Source
npmnpm
Version
0.0.3
Version published
Weekly downloads
5
-28.57%
Maintainers
1
Weekly downloads
 
Created
Source

pdftojson

Build Status Coverage Status

pdftojson is a pdftotext wrapper that generates JSON with bounding box data. It takes care of overlapping duplicated characters, which often exists in MS-Word-generated PDF files with floating images and text.

Why bother a wrapper for pdftotext?

Consider this PDF file:

PDF sample

pdftotext -bbox theFile.pdf would generate this:

...
<word xMin="103.320000" yMin="547.355700" xMax="152.368008" yMax="561.321720">(6)綠線</word>
<word xMin="155.880000" yMin="547.355700" xMax="176.846541" yMax="561.321720">G01</word>
<word xMin="155.880000" yMin="547.355700" xMax="162.867200" yMax="561.321720">G</word>
<word xMin="180.300000" yMin="547.355700" xMax="222.295867" yMax="561.321720">站延伸</word>
<word xMin="208.080000" yMin="547.355700" xMax="264.053062" yMax="561.321720">伸至大溪</word>
<word xMin="264.480000" yMin="547.355700" xMax="334.420485" yMax="561.321720">、龍潭先進</word>
<word xMin="320.340000" yMin="547.355700" xMax="348.294390" yMax="561.321720">進公</word>
<word xMin="124.680000" yMin="572.375700" xMax="166.675867" yMax="586.341720">共運輸</word>
<word xMin="152.700000" yMin="572.375700" xMax="222.644667" yMax="586.341720">輸系統發展</word>
<word xMin="208.440000" yMin="572.375700" xMax="278.395867" yMax="586.341720">展委託可行</word>
<word xMin="264.840000" yMin="572.375700" xMax="320.813062" yMax="586.341720">行性研究</word>
...

pdftotext does a great job "undoing" physical layout (columns, hyphenation, etc) of a PDF document. However, in its result there are some overlapping and duplicate words. PDF layout engines sometimes generate these quirks when images and text are mixed within a page.

On the other hand, pdftojson theFile.pdf could generate this:

...
{
    "xMin": 103.2,
    "xMax": 348.29439,
    "yMin": 547.3557,
    "yMax": 561.32172,
    "text": "(6)綠線 G01 站延伸至大溪、龍潭先進公"
},
{
    "xMin": 124.68,
    "xMax": 320.813062,
    "yMin": 572.3757,
    "yMax": 586.34172,
    "text": "共運輸系統發展委託可行性研究"
}
...

Install

$ npm install pdftojson

pdftojson uses pdftotext. Please make sure pdftotext is available in PATH.

Usage

pdftojson is available as a command line tool and a nodejs library.

CLI

# outputs some.json
$ pdftojson some.pdf

# converts page 3 ~ 6 of some.pdf and outputs to some.json
$ pdftojson -c "-f 3 -l 6" some.pdf

NodeJS Library

The library exposes a single function that takes the name of a PDF file and returns a promise.

import pdftojson from 'pdftojson';

pdftojson("./some.pdf").then((output) => {
  // output is a Javascript object.
});

Output format

All numeric values are in pt.

[
  { //: Page
    width: (Number) page width,
    height: (Number) page height,
    words: [
      {
        text: (String) the text enclosed in the bounding box,

        // All coordinates calculated from top-left corner of the page
        xMin: (Number) left edge of the bounding box,
        xMax: (Number) right edge of the bounding box,
        yMin: (Number) top edge of the bounding box,
        yMax: (Number) bottom edge of the bounding box
      }, // ...
    ]
  }, // ...
]

Keywords

pdftotext

FAQs

Package last updated on 15 Jul 2015

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts