New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More →

pdf.js-extract

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

pdf.js-extract

super-simple async PDF reader that extracts text with x,y page positions based on pdf.js

0.0.9
Source
npm

Version published: 7 years ago

Weekly downloads: 62K; increased by3.35%

Maintainers: 1

Weekly downloads

Created: 8 years ago

Source

pdf.js-extract

extracts text from PDF files

This is just a library packaged out of the examples for usage of pdf.js with nodejs.

It reads a pdf file and exports all pages & texts with coordinates. This can be e.g. used to extract structured table data.

This package includes a build of pdf.js. why? pdfs-dist installs not needed dependencies into production deployment.

Note: NO OCR!

Install

Convenience API


    var PDFExtract = require('pdf.js-extract').PDFExtract;
	var pdfExtract = new PDFExtract();
	var options = {}; /* options are handed over to pdf.js e.g, { password: 'somepassword' } */
	pdfExtract.extract(filename, options , function (err, data) {
		if (err) return console.log(err);
		console.log(data);
	});

Example Output

{
	"filename": "helloworld.pdf",
	"meta": {
		"info": {
			"PDFFormatVersion": "1.7",
			"IsAcroFormPresent": false,
			"IsXFAPresent": false
		},
		"metadata": null
	},
	"pages": [
		{
			"pageInfo": {
				"num": 1,
				"scale": 1,
				"rotation": 0,
				"offsetX": 0,
				"offsetY": 0,
				"width": 200,
				"height": 200,
				"fontScale": 1
			},
			"content": [
				{
					"x": 70,
					"y": 150,
					"str": "Hello, world!",
					"dir": "ltr",
					"width": 64.656,
					"height": 12,
					"fontName": "Times"
				}
			]
		}
	],
	"pdfInfo": {
		"numPages": 1,
		"fingerprint": "1ee9219eb9eaa49acbfc20155ac359c3"
	}
}

TODO

docu: utils for table parsing

Keywords

FAQs

What is pdf.js-extract?

Is pdf.js-extract popular?

Is pdf.js-extract well maintained?

Package last updated on 23 Feb 2018

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

pdf.js-extract

pdf.js-extract

Install

Convenience API

TODO

Keywords

Related posts

Typosquatted Go Packages Deliver Malware Loader Targeting Linux and macOS Systems

Bybit Hack Puts Crypto Losses at $1.6B, Surpassing All of Last Year in Just Two Months