
Security News
Software Engineering Daily Podcast: Feross on AI, Open Source, and Supply Chain Risk
Socket CEO Feross Aboukhadijeh joins Software Engineering Daily to discuss modern software supply chain attacks and rising AI-driven security risks.
pdf-parse-new
Advanced tools
Pure javascript cross-platform module to extract text from PDFs with AI-powered optimization and multi-core processing.
Pure JavaScript cross-platform module to extract text from PDFs with intelligent performance optimization.
Version 2.0.0 - Release with SmartPDFParser, multi-core processing, and AI-powered method selection based on 15,000+ real-world benchmarks.
β¨ SmartPDFParser with AI-Powered Selection
β‘ Multi-Core Performance
π Battle-Tested Intelligence
π Multiple Parsing Strategies
π§ Developer Experience
test/examples/npm install pdf-parse-new
SmartPDFParser - Intelligent automatic method selection
Multi-Core Processing
Performance Improvements
Better DX
Version 2.0.0 is backward compatible. Your existing code will continue to work:
// v1.x code still works
const pdf = require('pdf-parse-new');
pdf(buffer).then(data => console.log(data.text));
To take advantage of new features:
// Use SmartPDFParser for automatic optimization
const SmartParser = require('pdf-parse-new/lib/SmartPDFParser');
const parser = new SmartParser();
const result = await parser.parse(buffer);
console.log(`Used ${result._meta.method} in ${result._meta.duration}ms`);
const fs = require('fs');
const pdf = require('pdf-parse-new');
const dataBuffer = fs.readFileSync('path/to/file.pdf');
pdf(dataBuffer).then(function(data) {
console.log(data.numpages); // Number of pages
console.log(data.text); // Full text content
console.log(data.info); // PDF metadata
});
See test/examples/ for practical examples:
# Try the examples
npm run example:basic # Basic parsing
npm run example:smart # SmartPDFParser (recommended)
npm run example:compare # Compare all methods
# Or run directly
node test/examples/01-basic-parse.js
node test/examples/06-smart-parser.js
7 complete examples covering all parsing methods with real-world patterns!
const SmartParser = require('pdf-parse-new/lib/SmartPDFParser');
const parser = new SmartParser();
const dataBuffer = fs.readFileSync('large-document.pdf');
parser.parse(dataBuffer).then(function(result) {
console.log(`Parsed ${result.numpages} pages in ${result._meta.duration}ms`);
console.log(`Method used: ${result._meta.method}`);
console.log(result.text);
});
pdf(dataBuffer)
.then(data => {
// Process data
})
.catch(error => {
console.error('Error parsing PDF:', error);
});
The SmartPDFParser automatically selects the optimal parsing method based on PDF characteristics.
Based on 9,417 real-world benchmarks (trained 2025-11-23):
| Pages | Method | Avg Time | Best For |
|---|---|---|---|
| 1-10 | batch-5 | ~10ms | Tiny documents |
| 11-50 | batch-10 | ~107ms | Small documents |
| 51-200 | batch-20 | ~332ms | Medium documents |
| 201-500 | batch-50 | ~1102ms | Large documents |
| 501-1000 | batch-50 | ~1988ms | X-Large documents |
| 1000+ | processes* | ~2355-4468ms | Huge documents (2-4x faster!) |
*Both workers and processes are excellent for huge PDFs. Processes is the default due to better consistency, but workers can be faster in some cases. Use forceMethod: 'workers' to try workers.
const SmartParser = require('pdf-parse-new/lib/SmartPDFParser');
const parser = new SmartParser();
// Automatically selects best method
const result = await parser.parse(pdfBuffer);
const parser = new SmartParser({
forceMethod: 'workers' // 'batch', 'workers', 'processes', 'stream', 'sequential'
});
// Example: Compare workers vs processes for your specific PDFs
const testWorkers = new SmartParser({ forceMethod: 'workers' });
const testProcesses = new SmartParser({ forceMethod: 'processes' });
const result1 = await testWorkers.parse(hugePdfBuffer);
console.log(`Workers: ${result1._meta.duration}ms`);
const result2 = await testProcesses.parse(hugePdfBuffer);
console.log(`Processes: ${result2._meta.duration}ms`);
const parser = new SmartParser({
maxMemoryUsage: 2e9 // 2GB max
});
PDF parsing is I/O-bound. During I/O waits, CPU cores sit idle. Oversaturation keeps them busy:
const parser = new SmartParser({
oversaturationFactor: 1.5 // Use 1.5x more workers than cores
});
// Example on 24-core system:
// - Default (1.5x): 36 workers (instead of 23!)
// - Aggressive (2x): 48 workers
// - Conservative (1x): 24 workers
Why this works:
Automatic memory limiting:
const stats = parser.getStats();
console.log(stats);
// {
// totalParses: 10,
// methodUsage: { batch: 8, workers: 2 },
// averageTimes: { batch: 150.5, workers: 2300.1 },
// failedParses: 0
// }
SmartPDFParser automatically adapts to your CPU:
// On 4-core laptop
parser.parse(500_page_pdf);
// β Uses workers (threshold: ~167 pages)
// On 48-core server
parser.parse(500_page_pdf);
// β Uses batch (threshold: ~2000 pages, workers overhead not worth it yet)
This ensures optimal performance regardless of hardware! The decision tree was trained on multiple machines with different core counts.
SmartPDFParser uses intelligent fast-paths to minimize overhead:
const parser = new SmartParser();
// Tiny PDF (< 0.5 MB)
await parser.parse(tiny_pdf);
// β‘ Fast-path: ~0.5ms overhead (50x faster than tree navigation!)
// Small PDF (< 1 MB)
await parser.parse(small_pdf);
// β‘ Fast-path: ~0.5ms overhead
// Medium PDF (already seen similar)
await parser.parse(medium_pdf);
// πΎ Cache hit: ~1ms overhead
// Common scenario (500 pages, 5MB)
await parser.parse(common_pdf);
// π Common scenario: ~2ms overhead
// Rare case (unusual size/page ratio)
await parser.parse(unusual_pdf);
// π³ Full tree: ~25ms overhead (only for edge cases)
Overhead Comparison:
| PDF Type | Before | After | Speedup |
|---|---|---|---|
| Tiny (< 0.5 MB) | 25ms | 0.5ms | 50x faster β‘ |
| Small (< 1 MB) | 25ms | 0.5ms | 50x faster β‘ |
| Cached | 25ms | 1ms | 25x faster πΎ |
| Common | 25ms | 2ms | 12x faster π |
| Rare | 25ms | 25ms | Same π³ |
90%+ of PDFs hit a fast-path! This means minimal overhead even for tiny documents.
Parse a PDF file and extract text content.
Parameters:
dataBuffer (Buffer): PDF file bufferoptions (Object, optional):
pagerender (Function): Custom page rendering functionmax (Number): Maximum number of pages to parseversion (String): PDF.js version to useReturns: Promise
numpages (Number): Total number of pagesnumrender (Number): Number of rendered pagesinfo (Object): PDF metadatametadata (Object): PDF metadata objecttext (String): Extracted text contentversion (String): PDF.js version usedOptions:
forceMethod (String): Force specific parsing methodmaxMemoryUsage (Number): Maximum memory usage in bytesavailableCPUs (Number): Override CPU count detectionParse PDF with automatic method selection.
Returns: Promise (same as pdf() with additional _meta field)
_meta.method (String): Parsing method used_meta.duration (Number): Parse time in milliseconds_meta.analysis (Object): PDF analysis dataGet parsing statistics for current session.
This library includes full TypeScript definitions and works seamlessly with NestJS.
// β
CORRECT: Use namespace import
import * as PdfParse from 'pdf-parse-new';
// Create parser instance
const parser = new PdfParse.SmartPDFParser({
oversaturationFactor: 2.0,
enableFastPath: true
});
// Parse PDF
const result = await parser.parse(pdfBuffer);
console.log(`Parsed ${result.numpages} pages using ${result._meta.method}`);
// β WRONG: This will NOT work
import PdfParse from 'pdf-parse-new'; // Error: SmartPDFParser is not a constructor
import { SmartPDFParser } from 'pdf-parse-new'; // Error: No named export
import { Injectable } from '@nestjs/common';
import * as PdfParse from 'pdf-parse-new';
import * as fs from 'fs';
@Injectable()
export class PdfService {
private parser: PdfParse.SmartPDFParser;
constructor() {
// Initialize parser with custom options
this.parser = new PdfParse.SmartPDFParser({
oversaturationFactor: 2.0,
enableFastPath: true,
enableCache: true,
maxWorkerLimit: 50
});
}
async parsePdf(filePath: string): Promise<string> {
const dataBuffer = fs.readFileSync(filePath);
const result = await this.parser.parse(dataBuffer);
console.log(`Pages: ${result.numpages}`);
console.log(`Method: ${result._meta?.method}`);
console.log(`Duration: ${result._meta?.duration?.toFixed(2)}ms`);
return result.text;
}
getParserStats() {
return this.parser.getStats();
}
}
import { Controller, Post, UploadedFile, UseInterceptors } from '@nestjs/common';
import { FileInterceptor } from '@nestjs/platform-express';
import * as PdfParse from 'pdf-parse-new';
@Controller('pdf')
export class PdfController {
private parser = new PdfParse.SmartPDFParser({ oversaturationFactor: 2.0 });
@Post('upload')
@UseInterceptors(FileInterceptor('file'))
async uploadPdf(@UploadedFile() file: Express.Multer.File) {
const result = await this.parser.parse(file.buffer);
return {
pages: result.numpages,
text: result.text,
metadata: result.info,
parsingInfo: {
method: result._meta?.method,
duration: result._meta?.duration,
fastPath: result._meta?.fastPath || false
}
};
}
}
// Method 1: Namespace import (recommended)
import * as PdfParse from 'pdf-parse-new';
const parser = new PdfParse.SmartPDFParser();
// Method 2: CommonJS require
const PdfParse = require('pdf-parse-new');
const parser = new PdfParse.SmartPDFParser();
// Method 3: Direct module import
import SmartPDFParser = require('pdf-parse-new/lib/SmartPDFParser');
const parser = new SmartPDFParser();
All types are fully documented and available:
import * as PdfParse from 'pdf-parse-new';
// Use types from the namespace
type Result = PdfParse.Result;
type Options = PdfParse.Options;
type SmartParserOptions = PdfParse.SmartParserOptions;
π For more detailed examples and troubleshooting, see NESTJS_USAGE.md
For a 1500-page PDF:
| Method | Time (estimate) | Speed vs Batch | Notes |
|---|---|---|---|
| Workers | ~2.4-7s | 2-7x faster β¨ | Faster startup, can vary by PDF |
| Processes | ~4.2-4.5s | 3-4x faster | More consistent, better isolation |
| Batch | ~17.6s | baseline | Good up to 1000 pages |
| Sequential | ~17.8s | 0.99x | Fallback only |
Note: Performance varies by PDF complexity, size, and system. Both workers and processes provide significant speedup - test both on your specific PDFs to find the best option.
forceMethod: 'workers' to testBatch (default for most cases)
Workers (best for huge PDFs)
maxWorkers to 2-4 to avoid memory issuesProcesses (alternative to workers)
Stream (memory constrained)
Sequential (fallback)
The library includes comprehensive benchmarking tools for optimization.
benchmark/
βββ collect-benchmarks.js # Collect performance data
βββ train-smart-parser.js # Train decision tree
βββ test-pdfs.example.json # Example PDF list
βββ test-pdfs.json # Your PDFs (gitignored)
cp benchmark/test-pdfs.example.json benchmark/test-pdfs.json
# Edit test-pdfs.json with your PDF URLs/paths
node benchmark/collect-benchmarks.js
Features:
node benchmark/train-smart-parser.js
Analyzes collected benchmarks and generates optimized parsing rules.
{
"note": "Add your PDF URLs or file paths here",
"urls": [
"./test/data/sample.pdf",
"https://example.com/document.pdf",
"/absolute/path/to/file.pdf"
]
}
Out of Memory
// Limit memory usage
const parser = new SmartParser({ maxMemoryUsage: 2e9 });
// Or use streaming
const parser = new SmartParser({ forceMethod: 'stream' });
Slow Parsing
// For large PDFs, force workers
const parser = new SmartParser({ forceMethod: 'workers' });
Corrupted/Invalid PDFs
// More aggressive parsing
const pdf = require('pdf-parse-new/lib/pdf-parse-aggressive');
pdf(dataBuffer).then(data => console.log(data.text));
// Enable verbose logging
process.env.DEBUG = 'pdf-parse:*';
This library is designed to work correctly when installed as an npm module.
All internal paths use proper resolution:
path.join(__dirname, 'pdf-worker.js')path.join(__dirname, 'pdf-child.js')require('./pdf.js/v4.5.136/build/pdf.js')This ensures the library works correctly:
npm installnode_modules/ directoryThe library automatically resolves all internal paths - you don't need to configure anything!
function customPageRenderer(pageData) {
const renderOptions = {
normalizeWhitespace: true,
disableCombineTextItems: false
};
return pageData.getTextContent(renderOptions).then(textContent => {
let text = '';
for (let item of textContent.items) {
text += item.str + ' ';
}
return text;
});
}
const options = { pagerender: customPageRenderer };
pdf(dataBuffer, options).then(data => console.log(data.text));
// Parse only first 10 pages
pdf(dataBuffer, { max: 10 }).then(data => {
console.log(`Parsed ${data.numrender} of ${data.numpages} pages`);
});
const PDFProcess = require('pdf-parse-new/lib/pdf-parse-processes');
PDFProcess(dataBuffer, {
maxProcesses: 4, // Use 4 parallel processes
batchSize: 10 // Process 10 pages per batch
}).then(data => console.log(data.text));
| Feature | pdf-parse | pdf-parse-new 2.0 |
|---|---|---|
| Speed (huge PDFs) | Baseline | 2-4x faster β‘ |
| Smart optimization | β | β AI-powered |
| Multi-core support | β | β Workers + Processes |
| CPU adaptation | β | β 4-48+ cores |
| Fast-path | β | β 50x faster overhead |
| Caching | β | β LRU cache |
| TypeScript | Partial | β Complete |
| Examples | Basic | β 7 production-ready |
| Benchmarking | β | β Tools included |
| Maintenance | Slow | β Active |
9,924-page PDF (13.77 MB) on 24-core system:
Sequential: ~15,000ms
Batch-50: ~11,723ms
Processes: ~4,468ms β
(2.6x faster than batch)
Workers: ~6,963ms β
(1.7x faster than batch)
SmartParser: Automatically chooses Processes β‘
100 KB PDF on any system:
Overhead:
- Without fast-path: 25ms
- With fast-path: 0.5ms β
(50x faster)
Contributions are welcome! Please read our contributing guidelines.
git clone https://github.com/your-repo/pdf-parse-new.git
cd pdf-parse-new
npm install
npm test
npm test # Run all tests
npm run test:smart # Test smart parser
npm run benchmark # Run benchmarks
MIT License - see LICENSE file for details.
Major Features:
Performance:
Breaking Changes:
See CHANGELOG for complete version history.
Made with β€οΈ for the JavaScript community
npm: pdf-parse-new
Repository: GitHub
Issues: Report bugs
FAQs
Pure javascript cross-platform module to extract text from PDFs with AI-powered optimization and multi-core processing.
The npm package pdf-parse-new receives a total of 13,082 weekly downloads. As such, pdf-parse-new popularity was classified as popular.
We found that pdf-parse-new demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago.Β It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Socket CEO Feross Aboukhadijeh joins Software Engineering Daily to discuss modern software supply chain attacks and rising AI-driven security risks.

Security News
GitHub has revoked npm classic tokens for publishing; maintainers must migrate, but OpenJS warns OIDC trusted publishing still has risky gaps for critical projects.

Security News
Rustβs crates.io team is advancing an RFC to add a Security tab that surfaces RustSec vulnerability and unsoundness advisories directly on crate pages.