New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More
Socket
Sign inDemoInstall
Socket

pdf-text-reader

Package Overview
Dependencies
Maintainers
1
Versions
24
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

pdf-text-reader

Dead simple pdf text reader

  • 5.1.0
  • latest
  • Source
  • npm
  • Socket score

Version published
Weekly downloads
9.3K
increased by19.92%
Maintainers
1
Weekly downloads
 
Created
Source

PDF Text Reader

Dead simple PDF text reader for Node.js. Uses Mozilla's pdfjs-dist package.

Requires ESM and Node.js v22 or greater. (These are requirements from Mozilla's pdf-dist package itself.)

Install

npm install pdf-text-reader

Usage

  • Read all pages into a single string with readPdfText:

    import {readPdfText} from 'pdf-text-reader';
    
    async function main() {
        const pdfText: string = await readPdfText({url: 'path/to/pdf/file.pdf'});
        console.info(pdfText);
    }
    
    main();
    
  • Read a PDF into individual pages with readPdfPages:

    import {readPdfPages} from 'pdf-text-reader';
    
    async function main() {
        const pages = await readPdfPages({url: 'path/to/pdf/file.pdf'});
        console.info(pages[0]?.lines);
    }
    
    main();
    

See the types for detailed argument and return value types.

Details

This package simply reads the output of pdfjs.getDocument and sorts it into lines based on text position in the document. It also inserts spaces for text on the same line that is far apart horizontally and new lines in between lines that are far apart vertically.

Example:

The text below in a PDF will be read as having spaces in between them even if the space characters aren't in the PDF.

cell 1               cell 2                 cell 3

The number of spaces to insert is calculated by an extremely naive but very simple calculation of Math.ceil(distance-between-text/text-height).

Low Level Control

If you need lower level parsing control, you can also use the exported parsePageItems function. This only reads one page at a time as seen below. This function is used by readPdfPages so the output will be identical for the same pdf page.

You may need to independently install the pdfjs-dist npm package for this to work.

import * as pdfjs from 'pdfjs-dist';
import type {TextItem} from 'pdfjs-dist/types/src/display/api';
import {parsePageItems} from 'pdf-text-reader';

async function main() {
    const doc = await pdfjs.getDocument('myDocument.pdf').promise;
    const page = await doc.getPage(1); // 1-indexed
    const content = await page.getTextContent();
    const items: TextItem[] = content.items.filter((item): item is TextItem => 'str' in item);
    const parsedPage = parsePageItems(items);
    console.info(parsedPage.lines);
}

main();

Keywords

FAQs

Package last updated on 08 May 2024

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc