Research
Security News
Kill Switch Hidden in npm Packages Typosquatting Chalk and Chokidar
Socket researchers found several malicious npm packages typosquatting Chalk and Chokidar, targeting Node.js developers with kill switches and data theft.
pdf-extract
Advanced tools
Node PDF is a set of tools that takes in PDF files and converts them to usable formats for data processing. The library supports both extracting text from searchable pdf files as well as performing OCR on pdfs which are just scanned images of text
Node PDF is a set of tools that takes in PDF files and converts them to usable formats for data processing. The library supports both extracting text from searchable pdf files as well as performing OCR on pdfs which are just scanned images of text
To begin install the module.
npm install pdf-extract
After the library is installed you will need the following binaries accessible on your path to process pdfs.
To begin on OSX, first make sure you have the homebrew package manager installed.
pdftk is not available in Homebrew. However a gui install is available here. http://www.pdflabs.com/docs/install-pdftk/
pdftotext is included as part of the poppler utilities library. poppler can be installed via homebrew
brew install poppler
ghostscript can be install via homebrew
brew install gs
tesseract can be installed via homebrew as well
brew install tesseract
After tesseract is installed you need to install the alphanumeric config and an updated trained data file
cd <root of this module>
cp "./share/eng.traineddata" "/usr/local/Cellar/tesseract/3.02.02_3/share/tessdata/eng.traineddata"
cp "./share/dia.traineddata" "/usr/local/Cellar/tesseract/3.02.02_3/share/tessdata/dia.traineddata"
cp "./share/configs/alphanumeric" "/usr/local/Cellar/tesseract/3.02.02_3/share/tessdata/configs/alphanumeric"
pdftk can be installed directly via apt-get
apt-get install pdftk
pdftotext is included in the poppler-utils library. To installer poppler-utils execute
apt-get install poppler-utils
ghostscript can be install via apt-get
apt-get install ghostscript
tesseract can be installed via apt-get. Note that unlike the osx install the package is called tesseract-ocr on Ubuntu, not tesseract
apt-get install tesseract-ocr
For the OCR to work, you need to have the tesseract-ocr binaries available on your path. If you only need to handle ASCII characters, the accuracy of the OCR process can be increased by limiting the tesseract output. To do this copy the alphanumeric file included with this pdf-extract module into the tess-data folder on your system. Also the eng.traineddata included with the standard tesseract-ocr package is out of date. This pdf-extract module provides an up-to-date version which you should copy into the appropriate location on your system
cd <root of this module>
cp "./share/eng.traineddata" "/usr/share/tesseract-ocr/tessdata/eng.traineddata"
cp "./share/configs/alphanumeric" "/usr/share/tesseract-ocr/tessdata/configs/alphanumeric"
pdftk can be installed directly via apt-get
apt-get install pdftk
pdftotext is included in the poppler-utils library. To installer poppler-utils execute
apt-get install poppler-utils
ghostscript can be install via pkgin. Note you may need to update the pkgin repo to include the additional sources provided by Joyent. Check http://www.perkin.org.uk/posts/9000-packages-for-smartos-and-illumos.html for details
pkgin install ghostscript
tesseract can be must be manually downloaded and compiled. You must also install leptonica before installing tesseract. At the time of this writing leptonica is available from http://www.leptonica.com/download.html, with the latest version tarball available from http://www.leptonica.com/source/leptonica-1.69.tar.gz
pkgin install autoconf
wget http://www.leptonica.com/source/leptonica-1.69.tar.gz
tar -xvzf leptonica-1.69.tar.gz
cd leptonica-1.69
./configure
make
[sudo] make install
After installing leptonic move on to tesseract. Tesseract is available from https://code.google.com/p/tesseract-ocr/downloads/list with the latest version available from https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.02.tar.gz&can=2&q=
wget https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.02.tar.gz&can=2&q=
tar -xvzf tesseract-ocr-3.02.02.tar.gz
cd tesseract-ocr
./configure
make
[sudo] make install
Important! You will have to add some variables to the PATH of your machine. You do this by right clicking your computer in file explorer, select Properties, select Advanced System Settings, Environment Variables. You can then add the folder that contains the executables to the path variable.
pdftk can be installed using the PDFtk Server installer found here: https://www.pdflabs.com/tools/pdftk-server/ It should autmatically add itself to the PATH, if not, the default install location is "C:\Program Files (x86)\PDFtk Server\bin"
pdftotext can be installed using the recompiled poppler utils for windows, which have been collected and bundled here: http://manifestwebdesign.com/2013/01/09/xpdf-and-poppler-utils-on-windows/ Unpack these in a folder, (example: "C:\poppler-utils") and add this to the PATH.
ghostscript for Windows can be found at: http://www.ghostscript.com/download/gsdnld.html Make sure you download the General Public License and the correct version (32/64bit). Install it and go to the installation folder (default: "C:\Program Files\gs\gs9.19") and go into the bin folder. Rename the gswin64c to gs, and add the bin folder to your PATH.
tesseract can be build, but you can also download an older version which seems to work fine. Downloads at: https://sourceforge.net/projects/tesseract-ocr-alt/files/ Version tested is tesseract-ocr-setup-3.02.02.exe, the default install location is "C:\Program Files (x86)\Tesseract-OCR" and is also added to the PATH. Note, this is only when you've checked that it will install for everyone on the machine.
Everything should work after all this! If not, try restarting to make sure the PATH variables are correctly used. This setup was tested on a Windows 10 Pro N 64bit machine.
=======
Extract from a pdf file which contains a scanned image and no searchable text
var inspect = require('eyes').inspector({maxLength:20000});
var pdf_extract = require('pdf-extract');
var absolute_path_to_pdf = '~/Downloads/sample.pdf'
var options = {
type: 'ocr' // perform ocr to get the text within the scanned image
}
var processor = pdf_extract(absolute_path_to_pdf, options, function(err) {
if (err) {
return callback(err);
}
});
processor.on('complete', function(data) {
inspect(data.text_pages, 'extracted text pages');
callback(null, text_pages);
});
processor.on('error', function(err) {
inspect(err, 'error while extracting pages');
return callback(err);
});
Extract from a pdf file which contains actual searchable text
var inspect = require('eyes').inspector({maxLength:20000});
var pdf_extract = require('pdf-extract');
var absolute_path_to_pdf = '~/Downloads/electronic.pdf'
var options = {
type: 'text' // extract the actual text in the pdf file
}
var processor = pdf_extract(absolute_path_to_pdf, options, function(err) {
if (err) {
return callback(err);
}
});
processor.on('complete', function(data) {
inspect(data.text_pages, 'extracted text pages');
callback(null, data.text_pages);
});
processor.on('error', function(err) {
inspect(err, 'error while extracting pages');
return callback(err);
});
At a minimum you must specific the type of pdf extract you wish to perform
clean When the system performs extracts text from a multi-page pdf, it first splits the pdf into single pages. This are written to disk before the ocr occurs. For some applications these single page files can be useful. If you need to work with the single page pdf files after the ocr is complete, set the clean option to false as show below. Note that the single page pdf files are written to the system appropriate temp directory, so if you must copy the files to a more permanent location yourself after the ocr process completes
var options = {
type: 'ocr' // (required), perform ocr to get the text within the scanned image
clean: false // keep the single page pdfs created during the ocr process
ocr_flags: [
'-psm 1', // automatically detect page orientation
'-l dia', // use a custom language file
'alphanumeric' // only output ascii characters
]
}
When processing, the module will emit various events as they occurr
page Emitted when a page has completed processing. The data passed with this event looks like
var data = {
hash: <sha1 hash of the input pdf file here>
text: <extracted text here>,
index: 2,
num_pages: 4,
pdf_path: "~/Downloads/input_pdf_file.pdf",
single_page_pdf_path: "/tmp/temp_pdf_file2.pdf"
}
error Emitted when an error occurs during processing. After this event is emitted processing will stop. The data passed with this event looks like
var data = {
error: 'no file exists at the path you specified',
pdf_path: "~/Downloads/input_pdf_file.pdf",
}
complete Emitted when all pages have completed processing and the pdf extraction is complete
var data = {
hash: <sha1 hash of the input pdf file here>
text_pages: <Array of Strings, one per page>,
pdf_path: "~/Downloads/input_pdf_file.pdf",
single_page_pdf_file_paths: [
"/tmp/temp_pdf_file1.pdf",
"/tmp/temp_pdf_file2.pdf",
"/tmp/temp_pdf_file3.pdf",
"/tmp/temp_pdf_file4.pdf",
]
}
log To avoid spamming process.stdout, log events are emitted instead.
======= To test that your system satisfies the needed dependencies and that module is functioning correctly execute the command in the pdf-extract module folder
cd <project_root>/node_modules/pdf-extract
npm test
FAQs
Node PDF is a set of tools that takes in PDF files and converts them to usable formats for data processing. The library supports both extracting text from searchable pdf files as well as performing OCR on pdfs which are just scanned images of text
We found that pdf-extract demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 2 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket researchers found several malicious npm packages typosquatting Chalk and Chokidar, targeting Node.js developers with kill switches and data theft.
Security News
pnpm 10 blocks lifecycle scripts by default to improve security, addressing supply chain attack risks but sparking debate over compatibility and workflow changes.
Product
Socket now supports uv.lock files to ensure consistent, secure dependency resolution for Python projects and enhance supply chain security.