
Product
Introducing Tier 1 Reachability: Precision CVE Triage for Enterprise Teams
Socket’s new Tier 1 Reachability filters out up to 80% of irrelevant CVEs, so security teams can focus on the vulnerabilities that matter.
doc-extract
Advanced tools
A Node.js library for reading and extracting text from various document formats (PDF, DOCX, DOC, PPT, PPTX, TXT)
A powerful Node.js library for reading and extracting text from various document formats including PDF, DOCX, DOC, PPT, PPTX, and TXT files.
npm install doc-extract
This library depends on some system packages for full functionality:
For PDF support:
For PowerPoint and DOC support:
# Ubuntu/Debian
sudo apt-get install antiword unrtf poppler-utils tesseract-ocr
# macOS
brew install antiword unrtf poppler tesseract
# Windows
# Install poppler and tesseract manually or use chocolatey:
choco install poppler tesseract
import DocumentReader, { readDocument } from "doc-extract";
// Simple usage
const content = await readDocument("./path/to/document.pdf");
console.log(content.text);
console.log(content.metadata);
// Using the class for more control
const reader = new DocumentReader({ debug: true });
const content = await reader.readDocument("./path/to/document.docx");
new DocumentReader(options?: { debug?: boolean })
options.debug
: Enable debug logging (default: false)Read a document from file path.
const reader = new DocumentReader();
const content = await reader.readDocument("./document.pdf");
Read a document from a Buffer.
const fs = require("fs");
const buffer = fs.readFileSync("./document.pdf");
const content = await reader.readDocumentFromBuffer(buffer, "document.pdf");
Read multiple documents at once.
const contents = await reader.readMultipleDocuments([
"./doc1.pdf",
"./doc2.docx",
"./doc3.pptx",
]);
Read multiple documents from buffers.
const contents = await reader.readMultipleFromBuffers([
{ buffer: buffer1, fileName: "doc1.pdf" },
{ buffer: buffer2, fileName: "doc2.docx" },
]);
// PDF specific
const pdfContent = await reader.readPdf("./document.pdf");
// DOCX specific (includes HTML conversion)
const docxContent = await reader.readDocx("./document.docx");
console.log(docxContent.html); // HTML version of the document
// PowerPoint specific
const pptContent = await reader.readPowerPoint("./presentation.pptx");
// Check if format is supported
const isSupported = reader.isFormatSupported("./document.pdf"); // true
// Get all supported formats
const formats = reader.getSupportedFormats(); // ['pdf', 'docx', 'doc', 'pptx', 'ppt', 'txt']
// Validate file
await reader.validateFile("./document.pdf"); // throws error if invalid
import { readDocument, readDocumentFromBuffer } from "doc-extract";
// Quick read from file
const content = await readDocument("./document.pdf");
// Quick read from buffer
const content = await readDocumentFromBuffer(buffer, "document.pdf");
interface DocumentContent {
text: string;
metadata?: {
pages?: number;
words?: number;
characters?: number;
fileSize?: number;
fileName?: string;
};
}
interface PdfContent extends DocumentContent {
metadata: DocumentContent["metadata"] & {
pages: number;
info?: any; // PDF metadata from pdf-parse
};
}
interface DocxContent extends DocumentContent {
html?: string; // HTML version of the document
messages?: any[]; // Conversion messages from mammoth
}
enum SupportedFormats {
PDF = "pdf",
DOCX = "docx",
DOC = "doc",
PPTX = "pptx",
PPT = "ppt",
TXT = "txt",
}
The library uses custom error types for better error handling:
import { DocumentReaderError } from "doc-extract";
try {
const content = await readDocument("./nonexistent.pdf");
} catch (error) {
if (error instanceof DocumentReaderError) {
console.log("Error code:", error.code);
console.log("Error message:", error.message);
}
}
UNSUPPORTED_FORMAT
: File format not supportedREAD_ERROR
: General read errorPDF_READ_ERROR
: PDF-specific read errorDOCX_READ_ERROR
: DOCX-specific read errorTEXTRACT_READ_ERROR
: Textract-related errorBUFFER_READ_ERROR
: Buffer reading errorVALIDATION_ERROR
: File validation errorINVALID_FILE_PATH
: Invalid file pathimport express from "express";
import multer from "multer";
import { DocumentReader } from "doc-extract";
const app = express();
const upload = multer();
const reader = new DocumentReader();
app.post("/upload", upload.single("document"), async (req, res) => {
try {
if (!req.file) {
return res.status(400).json({ error: "No file uploaded" });
}
const content = await reader.readDocumentFromBuffer(
req.file.buffer,
req.file.originalname,
req.file.mimetype
);
res.json({
text: content.text,
metadata: content.metadata,
});
} catch (error) {
res.status(500).json({ error: error.message });
}
});
import { DocumentReader } from "doc-extract";
import { promises as fs } from "fs";
import path from "path";
async function processDocumentsInDirectory(dirPath: string) {
const reader = new DocumentReader({ debug: true });
const files = await fs.readdir(dirPath);
const documentPaths = files
.filter((file) => reader.isFormatSupportedByName(file))
.map((file) => path.join(dirPath, file));
const results = await reader.readMultipleDocuments(documentPaths);
results.forEach((content, index) => {
console.log(`Document ${documentPaths[index]}:`);
console.log(`Words: ${content.metadata?.words}`);
console.log(`Characters: ${content.metadata?.characters}`);
console.log("---");
});
}
import { DocumentReader } from "doc-extract";
async function searchInDocument(filePath: string, searchTerm: string) {
const reader = new DocumentReader();
const content = await reader.readDocument(filePath);
const lines = content.text.split("\n");
const matchingLines = lines
.map((line, index) => ({ line, lineNumber: index + 1 }))
.filter(({ line }) =>
line.toLowerCase().includes(searchTerm.toLowerCase())
);
return {
totalMatches: matchingLines.length,
matches: matchingLines,
metadata: content.metadata,
};
}
readMultipleDocuments()
is more efficient than individual callsContributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
git clone https://github.com/HaiderNakara/doc-extract.git
cd doc-extract
npm install
npm run build
npm test
npm test # Run tests once
npm run test:watch # Run tests in watch mode
npm run test:coverage # Run tests with coverage
This project is licensed under the MIT License - see the LICENSE file for details.
FAQs
A Node.js library for reading and extracting text from various document formats (PDF, DOCX, DOC, PPT, PPTX, TXT)
The npm package doc-extract receives a total of 12 weekly downloads. As such, doc-extract popularity was classified as not popular.
We found that doc-extract demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Product
Socket’s new Tier 1 Reachability filters out up to 80% of irrelevant CVEs, so security teams can focus on the vulnerabilities that matter.
Research
/Security News
Ongoing npm supply chain attack spreads to DuckDB: multiple packages compromised with the same wallet-drainer malware.
Security News
The MCP Steering Committee has launched the official MCP Registry in preview, a central hub for discovering and publishing MCP servers.