
Research
Two Malicious Rust Crates Impersonate Popular Logger to Steal Wallet Keys
Socket uncovers malicious Rust crates impersonating fast_log to steal Solana and Ethereum wallet keys from source code.
pdftotext wrapper that generates JSON with bounding box data. Takes care of duplicate characters.
pdftojson is a pdftotext
wrapper that generates JSON with bounding box data. It takes care of overlapping duplicated characters, which often exists in MS-Word-generated PDF files with floating images and text.
pdftotext
?Consider this PDF file:
pdftotext -bbox theFile.pdf
would generate this:
...
<word xMin="103.320000" yMin="547.355700" xMax="152.368008" yMax="561.321720">(6)綠線</word>
<word xMin="155.880000" yMin="547.355700" xMax="176.846541" yMax="561.321720">G01</word>
<word xMin="155.880000" yMin="547.355700" xMax="162.867200" yMax="561.321720">G</word>
<word xMin="180.300000" yMin="547.355700" xMax="222.295867" yMax="561.321720">站延伸</word>
<word xMin="208.080000" yMin="547.355700" xMax="264.053062" yMax="561.321720">伸至大溪</word>
<word xMin="264.480000" yMin="547.355700" xMax="334.420485" yMax="561.321720">、龍潭先進</word>
<word xMin="320.340000" yMin="547.355700" xMax="348.294390" yMax="561.321720">進公</word>
<word xMin="124.680000" yMin="572.375700" xMax="166.675867" yMax="586.341720">共運輸</word>
<word xMin="152.700000" yMin="572.375700" xMax="222.644667" yMax="586.341720">輸系統發展</word>
<word xMin="208.440000" yMin="572.375700" xMax="278.395867" yMax="586.341720">展委託可行</word>
<word xMin="264.840000" yMin="572.375700" xMax="320.813062" yMax="586.341720">行性研究</word>
...
pdftotext
does a great job "undoing" physical layout (columns, hyphenation, etc) of a PDF document. However, in its result there are some overlapping and duplicate words. PDF layout engines sometimes generate these quirks when images and text are mixed within a page.
On the other hand, pdftojson theFile.pdf
could generate this:
...
{
"xMin": 103.2,
"xMax": 348.29439,
"yMin": 547.3557,
"yMax": 561.32172,
"text": "(6)綠線 G01 站延伸至大溪、龍潭先進公"
},
{
"xMin": 124.68,
"xMax": 320.813062,
"yMin": 572.3757,
"yMax": 586.34172,
"text": "共運輸系統發展委託可行性研究"
}
...
$ npm install pdftojson
pdftojson
uses pdftotext
. Please make sure pdftotext
is available in PATH
.
pdftojson is available as a command line tool and a nodejs library.
# outputs some.json
$ pdftojson some.pdf
# converts page 3 ~ 6 of some.pdf and outputs to some.json
$ pdftojson -c "-f 3 -l 6" some.pdf
The library exposes a single function that takes the name of a PDF file and returns a promise.
import pdftojson from 'pdftojson';
pdftojson("./some.pdf").then((output) => {
// output is a Javascript object.
});
All numeric values are in pt
.
[
{ //: Page
width: (Number) page width,
height: (Number) page height,
words: [
{
text: (String) the text enclosed in the bounding box,
// All coordinates calculated from top-left corner of the page
xMin: (Number) left edge of the bounding box,
xMax: (Number) right edge of the bounding box,
yMin: (Number) top edge of the bounding box,
yMax: (Number) bottom edge of the bounding box
}, // ...
]
}, // ...
]
FAQs
pdftotext wrapper that generates JSON with bounding box data. Takes care of duplicate characters.
The npm package pdftojson receives a total of 5 weekly downloads. As such, pdftojson popularity was classified as not popular.
We found that pdftojson demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Socket uncovers malicious Rust crates impersonating fast_log to steal Solana and Ethereum wallet keys from source code.
Research
A malicious package uses a QR code as steganography in an innovative technique.
Research
/Security News
Socket identified 80 fake candidates targeting engineering roles, including suspected North Korean operators, exposing the new reality of hiring as a security function.