Security News
Introducing the Socket Python SDK
The initial version of the Socket Python SDK is now on PyPI, enabling developers to more easily interact with the Socket REST API in Python projects.
PDF to HTML or Text conversion using Apache Tika. Also generate PDF thumbnail using Apache PDFBox.
pdf2html helps to convert PDF file to HTML or Text using Apache Tika. This module also helps to generate thumbnail image for PDF file using Apache PDFBox.
via yarn:
yarn add pdf2html
via npm:
npm install --save pdf2html
Java runtime environment (JRE) is required to run this module.
const pdf2html = require('pdf2html');
const html = await pdf2html.html('sample.pdf');
console.log(html);
const text = await pdf2html.text('sample.pdf');
console.log(text);
const htmlPages = await pdf2html.pages('sample.pdf');
console.log(htmlPages);
const options = { text: true };
const textPages = await pdf2html.pages('sample.pdf', options);
console.log(textPages);
const meta = await pdf2html.meta('sample.pdf');
console.log(meta);
The maxBuffer option specifies the largest number of bytes allowed on stdout or stderr. If this value is exceeded, then the child process is terminated.
By default, the maximum buffer size is 2MB. You can customize it by passing the maxBuffer
option.
await pdf2html.meta('sample.pdf', { maxBuffer: 1024 * 10000 }); // set maxBuffer to 10MB
await pdf2html.html('sample.pdf', { maxBuffer: 1024 * 10000 });
await pdf2html.text('sample.pdf', { maxBuffer: 1024 * 10000 });
await pdf2html.pages('sample.pdf', { maxBuffer: 1024 * 10000 });
await pdf2html.thumbnail('sample.pdf', { maxBuffer: 1024 * 10000 });
const thumbnailPath = await pdf2html.thumbnail('sample.pdf');
console.log(thumbnailPath);
const options = { page: 1, imageType: 'png', width: 160, height: 226 };
const thumbnailPath = await pdf2html.thumbnail('sample.pdf', options);
console.log(thumbnailPath);
Sometimes downloading the dependencies might be too slow or unable to download in a HTTP proxy environment. Follow the step below to skip the dependency downloads.
cd node_modules/pdf2html/vendor
# These URLs come from https://github.com/shebinleo/pdf2html/blob/master/postinstall.js#L6-L7
wget https://archive.apache.org/dist/pdfbox/2.0.27/pdfbox-app-2.0.27.jar
wget https://archive.apache.org/dist/tika/2.6.0/tika-app-2.6.0.jar
FAQs
PDF to HTML or Text conversion using Apache Tika. Also generate PDF thumbnail using Apache PDFBox.
The npm package pdf2html receives a total of 8,466 weekly downloads. As such, pdf2html popularity was classified as popular.
We found that pdf2html demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The initial version of the Socket Python SDK is now on PyPI, enabling developers to more easily interact with the Socket REST API in Python projects.
Security News
Floating dependency ranges in npm can introduce instability and security risks into your project by allowing unverified or incompatible versions to be installed automatically, leading to unpredictable behavior and potential conflicts.
Security News
A new Rust RFC proposes "Trusted Publishing" for Crates.io, introducing short-lived access tokens via OIDC to improve security and reduce risks associated with long-lived API tokens.