Security News
Cloudflare Adds Security.txt Setup Wizard
Cloudflare has launched a setup wizard allowing users to easily create and manage a security.txt file for vulnerability disclosure on their websites.
pdf-text-extract
Advanced tools
Extract text from pdfs that contain searchable pdf text. The module is wrapper that calls the pdftotext
command to perform the actual extraction
npm install --save pdf-text-extract
You will need the pdftotext
binary available on your path. There are packages available for many different operating systems
See https://github.com/nisaacson/pdf-extract#osx for how to install the pdftotext
command
extract(filePath, [options], [pdftotextcommand], callback)
Options and pdftotextcommand are not required.
var path = require('path')
var filePath = path.join(__dirname, 'test/data/multipage.pdf')
var extract = require('pdf-text-extract')
extract(filePath, function (err, pages) {
if (err) {
console.dir(err)
return
}
console.dir(pages)
})
The output will be an array of where each entry is a page of text. If you want just a string of all pages you can set the option to splitPages: false
.
var filePath = path.join(__dirname, 'test/data/multipage.pdf')
var extract = require('pdf-text-extract')
extract(filePath, { splitPages: false }, function (err, text) {
if (err) {
console.dir(err)
return
}
console.dir(text)
})
You can set the following options:
firstPage
: First page to extractlastPage
: Last page to extractresolution
: in dpi, as is specified by pdftotext -rcrop
: Should be an object { x:x, y:y, w:w, h:h }layout
: Should be either layout
, raw
or htmlmeta
. Default: layout
encoding
: Should be either UCS-2
, ASCII7
, Latin1
, UTF-8
, ZapfDingbats
or Symbol
. Default: UTF-8
eol
: End of line convention. One of either: unix
, dos
or mac
ownerPassword
: Owner password (for encrypted files)userPassword
: User password (for encrypted files)splitPages
: If true, the result will be and array of pages. Default: true.If needed you can pass an optional arguments to the extract function. These will be passed to the child_process.spawn
call.
var filePath = path.join(__dirname, 'test/data/multipage.pdf')
var extract = require('pdf-text-extract')
var options = {
cwd: "./"
}
extract(filePath, options, function (err, pages) {
if (err) {
console.dir(err)
return
}
console.dir('extracted pages', pages)
})
You can also override the command for pdftotext
if it is installed in a location that is not available in the PATH
environment variable
var filePath = path.join(__dirname, 'test/data/multipage.pdf')
var pdfToTextCommand = '/opt/bin/pdftotext'
var extract = require('pdf-text-extract')
var options = {
cwd: "./"
}
extract(filePath, options, pdfToTextCommand, function (err, pages) {
if (err) {
console.dir(err)
return
}
console.dir('extracted pages', pages)
})
npm install -g pdf-text-extract
Execute with the filePath as an argument. Output will be json-formatted array of pages
pdf-text-extract ./test/data/multipage.pdf
# outputs
# ['<page 1 content...>', '<page 2 content...>']
# install dev dependencies
npm install
# run tests
npm test
FAQs
Extract text from pdfs that contain searchable pdf text
The npm package pdf-text-extract receives a total of 17,242 weekly downloads. As such, pdf-text-extract popularity was classified as popular.
We found that pdf-text-extract demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 2 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Cloudflare has launched a setup wizard allowing users to easily create and manage a security.txt file for vulnerability disclosure on their websites.
Security News
The Socket Research team breaks down a malicious npm package targeting the legitimate DOMPurify library. It uses obfuscated code to hide that it is exfiltrating browser and crypto wallet data.
Security News
ENISA’s 2024 report highlights the EU’s top cybersecurity threats, including rising DDoS attacks, ransomware, supply chain vulnerabilities, and weaponized AI.