You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP →

Book a Demo Install Sign in

pdfdataextract

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

pdfdataextract

Extract data from a pdf with pure javascript

4.0.0

latest

Source

npm

Version published: 4 months ago

Weekly downloads: 4.5K

Maintainers: 1

Weekly downloads

Created: 4 years ago

Source

pdfdataextract

Extract data from a pdf with pure javascript.

The PdfData wrapper over PdfDataExtractor is inspired by https://www.npmjs.com/package/pdf-parse, which is currently unmaintained. PdfDataExtractor itself is a simple interface to extract individual data from a pdf file.

Install

npm install pdfdataextract

Docs

Full documentation is available at the wiki

Usage

PdfData is a wrapper around PdfDataExtractor to directly get a complete json structure.

import { PdfData, VerbosityLevel } from 'pdfdataextract';
import { readFileSync } from 'fs';
const file_data = readFileSync('some_pdf_file.pdf');

// all options are optional
PdfData.extract(file_data, {
	password: '123456', // password of the pdf file
	pages: 1, // how many pages should be read at most
	sort: true, // sort the text by text coordinates
	verbosity: VerbosityLevel.ERRORS, // set the verbosity level for parsing
	get: { // enable or disable data extraction (all are optional and enabled by default)
		pages: true, // get number of pages
		text: true, // get text of each page
		fingerprint: true, // get fingerprint
		outline: true, // get outline
		metadata: true, // get metadata
		info: true, // get info
		permissions: true, // get permissions
	},
}).then((data) => {
	data.pages; // the number of pages
	data.text; // an array of text pages
	data.fingerprint; // fingerprint of the pdf document
	data.outline; // outline data of the pdf document
	data.info; // information of the pdf document, such as Author
	data.metadata; // metadata of the pdf document
	data.permissions; // permissions for the document
});

import { PdfDataExtractor, VerbosityLevel } from 'pdfdataextract';
import { readFileSync } from 'fs';
const file_data = readFileSync('some_pdf_file.pdf');

// all options are optional
PdfDataExtractor.get(file_data, {
	password: '123456', // password of the pdf file
	verbosity: VerbosityLevel.ERRORS, // set the verbosity level for parsing
}).then((extractor) => {
	extractor.pages; // the number of pages
	extractor.fingerprint; // fingerprint of the pdf document

	extractor.getText(1, true).then((text) => {
		// an array of text pages (only one page and sorted)
	});

	extractor.getText([2]).then((text) => {
		// an array of text pages (only the second page)
	});

	extractor.getOutline().then((outline) => {
		// outline data of the pdf document
	});
	
	extractor.getMetadata().then((metadata) => {
		// metadata of the pdf document
	});

	extractor.getPermissions().then((permissions) => {
		// permissions for the document
	});

	extractor.close();
});

Test

npm test

License

MIT licensed

Keywords

FAQs

What is pdfdataextract?

Is pdfdataextract popular?

Is pdfdataextract well maintained?

Package last updated on 29 Mar 2025

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

pdfdataextract

pdfdataextract

Install

Docs

Usage

Test

License

Keywords

Related posts

Introducing License Overlays: Smarter License Management for Real-World Code

Introducing Rust Support in Socket