Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
pdf-text-extract
Advanced tools
Extract text from pdfs that contain searchable pdf text. The module is wrapper that calls the pdftotext
command to perform the actual extraction
npm install --save pdf-text-extract
You will need the pdftotext
binary available on your path. There are packages available for many different operating systems
See https://github.com/nisaacson/pdf-extract#osx for how to install the pdftotext
command
extract(filePath, [options], [pdftotextcommand], callback)
Options and pdftotextcommand are not required.
var path = require('path')
var filePath = path.join(__dirname, 'test/data/multipage.pdf')
var extract = require('pdf-text-extract')
extract(filePath, function (err, pages) {
if (err) {
console.dir(err)
return
}
console.dir(pages)
})
The output will be an array of where each entry is a page of text. If you want just a string of all pages you can set the option to splitPages: false
.
var filePath = path.join(__dirname, 'test/data/multipage.pdf')
var extract = require('pdf-text-extract')
extract(filePath, { splitPages: false }, function (err, text) {
if (err) {
console.dir(err)
return
}
console.dir(text)
})
You can set the following options:
firstPage
: First page to extractlastPage
: Last page to extractresolution
: in dpi, as is specified by pdftotext -rcrop
: Should be an object { x:x, y:y, w:w, h:h }layout
: Should be either layout
, raw
or htmlmeta
. Default: layout
encoding
: Should be either UCS-2
, ASCII7
, Latin1
, UTF-8
, ZapfDingbats
or Symbol
. Default: UTF-8
eol
: End of line convention. One of either: unix
, dos
or mac
ownerPassword
: Owner password (for encrypted files)userPassword
: User password (for encrypted files)splitPages
: If true, the result will be and array of pages. Default: true.If needed you can pass an optional arguments to the extract function. These will be passed to the child_process.spawn
call.
var filePath = path.join(__dirname, 'test/data/multipage.pdf')
var extract = require('pdf-text-extract')
var options = {
cwd: "./"
}
extract(filePath, options, function (err, pages) {
if (err) {
console.dir(err)
return
}
console.dir('extracted pages', pages)
})
You can also override the command for pdftotext
if it is installed in a location that is not available in the PATH
environment variable
var filePath = path.join(__dirname, 'test/data/multipage.pdf')
var pdfToTextCommand = '/opt/bin/pdftotext'
var extract = require('pdf-text-extract')
var options = {
cwd: "./"
}
extract(filePath, options, pdfToTextCommand, function (err, pages) {
if (err) {
console.dir(err)
return
}
console.dir('extracted pages', pages)
})
npm install -g pdf-text-extract
Execute with the filePath as an argument. Output will be json-formatted array of pages
pdf-text-extract ./test/data/multipage.pdf
# outputs
# ['<page 1 content...>', '<page 2 content...>']
# install dev dependencies
npm install
# run tests
npm test
FAQs
Extract text from pdfs that contain searchable pdf text
We found that pdf-text-extract demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 2 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.