
Security News
Attackers Are Hunting High-Impact Node.js Maintainers in a Coordinated Social Engineering Campaign
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.
Fast PDF text extraction with paragraph layout and bounding regions.
pnpm add sigocr
import { sigocr } from "sigocr";
// Extract structured text from a PDF (returns null if scanned/no embedded text)
const doc = await sigocr.pdf("/path/to/file.pdf");
// -> Document | null
// Extract from a buffer
const doc = await sigocr.buffer(pdfBuffer);
// Check if a PDF has embedded text (fast, doesn't extract)
const has = await sigocr.hasText("/path/to/file.pdf");
// -> boolean
Extract many files in a single native call. Each file uses internal page-chunk parallelism via Rayon.
const docs = await sigocr.files([
"/uploads/contract.pdf",
"/uploads/invoice.pdf",
"/uploads/report.pdf",
]);
// -> (Document | null)[]
The output shape:
interface Document {
content: string; // full text in reading order
pages: Page[];
paragraphs: Paragraph[];
tables: Table[];
}
interface Paragraph {
content: string;
role?: string; // "title" | "sectionHeading" | "pageHeader" | "pageFooter" | "pageNumber"
spans: Span[]; // character offset + length into Document.content
boundingRegions: BoundingRegion[]; // page number + polygon in points
}
Measured on Apple M3 Pro, Node.js v24.
| Library | Mean | ms/page | vs sigocr |
|---|---|---|---|
| sigocr (native) | 16.5ms | 0.08 | 1x |
| pdf.js-extract (JS) | 71.0ms | 0.36 | 4.3x slower |
| pdf2json (JS) | 280ms | 1.40 | 17x slower |
| Library | Mean | ms/page |
|---|---|---|
| pdf-parse (JS) | 51.6ms | 0.26 |
| unpdf (JS) | 63.9ms | 0.32 |
| Library | Mean | vs sigocr |
|---|---|---|
| sigocr.files (native batch) | 17.2ms | 1x |
| sigocr.pdf (sequential loop) | 97.3ms | 5.7x slower |
| pdf.js-extract (sequential loop) | 134.0ms | 7.8x slower |
| Mean | |
|---|---|
| sigocr hasTextBuffer | 1.7ms |
sigocr is 4.3x faster than pdf.js-extract and 17x faster than pdf2json while producing richer output: grouped paragraphs with roles, bounding regions, and document-level spans. pdf.js-extract gives raw text items that still need assembly. sigocr is also 3.1x faster than plain-text-only extractors despite doing strictly more work. Batch extraction distributes files across cores via Rayon - 100 PDFs in 17ms.
Run benchmarks yourself:
pnpm bench
null for scanned PDFs: No OCR engine - this library only handles embedded textpdf_oxide for character-level extraction. No PDFium/MuPDF binaries to ship. ~1 MB packagehasText() fast path: Check if a PDF has embedded text without full extraction. Checks first 3 pages onlyString.slice()FAQs
Fast PDF text extraction with paragraph layout and bounding regions.
We found that sigocr demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.

Security News
Axios compromise traced to social engineering, showing how attacks on maintainers can bypass controls and expose the broader software supply chain.

Security News
Node.js has paused its bug bounty program after funding ended, removing payouts for vulnerability reports but keeping its security process unchanged.