New Research: Supply Chain Attack on Axios Pulls Malicious Dependency from npm.Details → →

Book a Demo Sign in

pdf-brain

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

pdf-brain

Local PDF & Markdown knowledge base with semantic search, AI enrichment, and SKOS taxonomy

latest

npm

Version: 2.0.0

Version published: 2 months ago

Maintainers: 1

Created: 4 months ago

Source

pdf-brain

Local PDF & Markdown knowledge base with semantic search and AI-powered enrichment.

Works with PDFs AND Markdown files - Index your research papers, books, notes, docs, and any .md files in one unified, searchable knowledge base.

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  PDF / MD   │────▶│   Ollama    │────▶│   Ollama    │────▶│   libSQL    │
│  (extract)  │     │    (LLM)    │     │ (embeddings)│     │  (vectors)  │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
      │                   │                   │                   │
   pdf-parse         llama3.2:3b        mxbai-embed          HNSW index
   + markdown        enrichment          1024 dims           cosine sim

Features

PDF + Markdown - Index .pdf and .md files with the same workflow
Local-first - Everything runs on your machine, no API costs
AI enrichment - LLM extracts titles, summaries, tags, and concepts
SKOS taxonomy - Organize documents with hierarchical concepts
Vector search - Semantic search via Ollama embeddings
Hybrid search - Combine vector similarity with full-text search
MCP server - Use with Claude, Cursor, and other AI assistants

Quick Start

Note: pdf-brain is agent-first and emits a single JSON envelope to stdout by default.
Use --format text for human-readable output (and TUI/progress rendering), or inspect the machine contract via pdf-brain capabilities.

# 1. Install (standalone binary, no runtime needed)
curl -fsSL https://raw.githubusercontent.com/joelhooks/pdf-brain/main/scripts/install.sh | bash

# 2. Install Ollama (macOS)
brew install ollama

# 3. Pull required models
ollama pull mxbai-embed-large   # embeddings (required)
ollama pull llama3.2:3b         # enrichment (optional but recommended)

# 4. Start Ollama
ollama serve

# 5. Initialize (creates DB + seeds starter taxonomy)
pdf-brain init

# 6. Add your first document
pdf-brain add ~/Documents/paper.pdf --enrich

Installation

Prerequisites

Ollama is required for embeddings. The LLM model is optional but recommended for enrichment.

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download

Models

# Required: Embedding model (1024 dimensions)
ollama pull mxbai-embed-large

# Recommended: Local LLM for enrichment
ollama pull llama3.2:3b

# Start Ollama server
ollama serve

Install pdf-brain

# Standalone binary (no runtime needed)
curl -fsSL https://raw.githubusercontent.com/joelhooks/pdf-brain/main/scripts/install.sh | bash

# or via npm
npm install -g pdf-brain

CLI Reference

Agent Output (Default)

pdf-brain is optimized for agentic workflows: stdout is machine-readable by default.

--format json|ndjson|text (default: json)
--pretty pretty-print JSON
--quiet (alias: --no-hints) omit nextActions
--log-level silent|error|info|debug (logs go to stderr)

Discover the full command/tool contract (including JSON Schemas) at runtime:

pdf-brain capabilities

Basic Commands

# Check Ollama status
pdf-brain check

# Show library stats
pdf-brain stats

# Initialize library (creates DB, seeds taxonomy)
pdf-brain init

Adding Documents

# Add a PDF
pdf-brain add /path/to/document.pdf

# Add a Markdown file
pdf-brain add /path/to/notes.md

# Add from URL (PDF or MD)
pdf-brain add https://example.com/paper.pdf
pdf-brain add https://raw.githubusercontent.com/user/repo/main/README.md

# Add with manual tags
pdf-brain add document.pdf --tags "ai,agents,research"

# Add with AI enrichment (extracts title, summary, concepts)
pdf-brain add document.pdf --enrich
pdf-brain add notes.md --enrich

Searching

# Semantic search (uses embeddings)
pdf-brain search "context engineering patterns"

# Full-text search only (faster, no embeddings)
pdf-brain search "context engineering" --fts

# Hybrid search (combines both)
pdf-brain search "machine learning" --hybrid

# Limit results
pdf-brain search "query" --limit 5

# Expand context around matches
pdf-brain search "query" --expand 500

Managing Documents

# List all documents
pdf-brain list

# List by tag
pdf-brain list --tag ai

# Get document details
pdf-brain read "document-title"

# Remove a document
pdf-brain remove "document-title"

# Update tags
pdf-brain tag "document-title" "new,tags,here"

Taxonomy Commands

The taxonomy system uses SKOS (Simple Knowledge Organization System) for hierarchical concept organization.

# List all concepts
pdf-brain taxonomy list

# Show concept tree
pdf-brain taxonomy tree

# Show subtree from a concept
pdf-brain taxonomy tree programming

# Search concepts
pdf-brain taxonomy search "machine learning"

# Add a new concept
pdf-brain taxonomy add ai/transformers --label "Transformers" --broader ai-ml

# Assign concept to document
pdf-brain taxonomy assign "doc-id" "programming/typescript"

# Seed taxonomy from JSON file
pdf-brain taxonomy seed --file data/taxonomy.json

Bulk Ingest

Recursively ingest directories containing PDFs and/or Markdown files:

# Ingest a directory with full LLM enrichment
pdf-brain ingest ~/Documents/papers --enrich

# Ingest your Obsidian vault or notes folder
pdf-brain ingest ~/Documents/obsidian --enrich

# Ingest multiple directories (PDFs, Markdown, mixed)
pdf-brain ingest ~/papers ~/books ~/notes --enrich

# With manual tags
pdf-brain ingest ~/books --tags "books,reference"

# Auto-tag only (faster, heuristics + light LLM)
pdf-brain ingest ~/docs --auto-tag

# Process only first N files (for testing)
pdf-brain ingest ~/papers --enrich --sample 10

# Disable TUI for simple output
pdf-brain ingest ~/papers --enrich --no-tui

Supported formats:

.pdf - Research papers, books, documents
.md - Notes, documentation, Obsidian vaults, READMEs

Enrichment

When you add documents with --enrich, the LLM extracts:

Field	Description
title	Clean, properly formatted title
author	Author name(s) if detectable
summary	2-3 sentence summary
documentType	book, paper, tutorial, guide, article, etc.
category	Primary category
tags	5-10 descriptive tags
concepts	Matched concepts from your taxonomy
proposedConcepts	New concepts the LLM suggests adding

LLM Providers

Enrichment supports multiple providers via the config system:

# Check current config
pdf-brain config show

# Use local Ollama (default)
pdf-brain config set enrichment.provider ollama
pdf-brain config set enrichment.model llama3.2:3b

# Use AI Gateway (Anthropic, OpenAI, etc.)
pdf-brain config set enrichment.provider gateway
pdf-brain config set enrichment.model anthropic/claude-haiku-4-5
export AI_GATEWAY_API_KEY=your-key

# Provider priority: config > CLI flag > auto-detect
pdf-brain add paper.pdf --enrich              # uses config
pdf-brain add paper.pdf --enrich --provider ollama  # override

Enrichment Fallback

If LLM enrichment fails (API error, rate limit, malformed response), pdf-brain automatically falls back to heuristic-based enrichment:

Title: Cleaned from filename
Tags: Extracted from path, filename, and content keywords
Category: Inferred from directory structure

The actual error is logged so you can debug provider issues.

Taxonomy

The taxonomy is a hierarchical concept system for organizing documents. It ships with a starter taxonomy covering:

Programming - TypeScript, React, Next.js, Testing, Architecture, DevOps, AI/ML
Education - Instructional Design, Learning Science, Course Creation, Assessment
Business - Marketing, Copywriting, Bootstrapping, Product, Sales
Design - UX, Visual Design, Systems Thinking, Information Architecture
Meta - Productivity, Note-taking, Knowledge Management, Writing

Growing Your Taxonomy

When enriching documents, the LLM may propose new concepts. These are saved for review:

# See proposed concepts from enrichment
pdf-brain taxonomy proposed

# Accept a specific concept
pdf-brain taxonomy accept ai/rag --broader ai-ml

# Accept all proposed concepts
pdf-brain taxonomy accept --all

# Reject a concept
pdf-brain taxonomy reject ai/rag

# Clear all proposals
pdf-brain taxonomy clear-proposed

# Manually add a concept
pdf-brain taxonomy add ai/rag --label "RAG" --broader ai-ml

# Or edit data/taxonomy.json and re-seed
pdf-brain taxonomy seed --file data/taxonomy.json

Custom Taxonomy

Create your own taxonomy.json:

{
  "concepts": [
    { "id": "cooking", "prefLabel": "Cooking" },
    { "id": "cooking/baking", "prefLabel": "Baking" },
    { "id": "cooking/grilling", "prefLabel": "Grilling" }
  ],
  "hierarchy": [
    { "conceptId": "cooking/baking", "broaderId": "cooking" },
    { "conceptId": "cooking/grilling", "broaderId": "cooking" }
  ]
}

pdf-brain taxonomy seed --file my-taxonomy.json

Configuration

Config File

pdf-brain stores configuration in $PDF_LIBRARY_PATH/config.json:

# Show all config
pdf-brain config show

# Get a specific value
pdf-brain config get enrichment.provider

# Set a value
pdf-brain config set enrichment.model anthropic/claude-haiku-4-5

Config Options

{
  "ollama": {
    "host": "http://localhost:11434"
  },
  "embedding": {
    "provider": "ollama",
    "model": "mxbai-embed-large"
  },
  "enrichment": {
    "provider": "gateway",
    "model": "anthropic/claude-haiku-4-5"
  },
  "judge": {
    "provider": "gateway",
    "model": "anthropic/claude-haiku-4-5"
  }
}

Setting	Default	Description
`ollama.host`	`http://localhost:11434`	Ollama API endpoint
`embedding.provider`	`ollama`	Embedding provider (ollama only)
`embedding.model`	`mxbai-embed-large`	Embedding model (1024 dims)
`enrichment.provider`	`ollama`	LLM provider: `ollama` or `gateway`
`enrichment.model`	`llama3.2:3b`	Model for document enrichment
`judge.provider`	`ollama`	Provider for concept deduplication
`judge.model`	`llama3.2:3b`	Model for judging duplicate concepts

Environment Variables

Variable	Default	Description
`PDF_LIBRARY_PATH`	`~/Documents/.pdf-library`	Library storage location
`OLLAMA_HOST`	`http://localhost:11434`	Ollama API endpoint
`AI_GATEWAY_API_KEY`	-	API key for AI Gateway
`PDF_BRAIN_LOG_LEVEL`	`silent`	stderr logging verbosity
`PDF_BRAIN_QUERY_EMBED_CACHE_SIZE`	`256`	Query embedding LRU cache size (0 disables)

AI Gateway

For cloud LLM providers (Anthropic, OpenAI, etc.), use the AI Gateway:

# Set your API key
export AI_GATEWAY_API_KEY=your-key

# Configure to use gateway
pdf-brain config set enrichment.provider gateway
pdf-brain config set enrichment.model anthropic/claude-haiku-4-5

# Other supported models:
# - anthropic/claude-sonnet-4-20250514
# - openai/gpt-4o-mini
# - openai/gpt-4o

Storage

~/Documents/.pdf-library/
├── library.db          # libSQL database (vectors, FTS, metadata, taxonomy)
├── library.db-shm      # Shared memory (WAL mode)
├── library.db-wal      # Write-ahead log
└── downloads/          # PDFs downloaded from URLs

Database Size

The database can get large due to vector index overhead. For ~500k chunks:

Component	Size	Notes
Text content	~180MB	Actual chunk text
Embeddings	~1.9GB	500k × 1024 dims × 4 bytes
Vector index	~48GB	HNSW neighbor graphs (~100KB/row)
FTS index	~200MB	Full-text search

The *_idx_shadow tables store HNSW neighbor graphs for approximate nearest neighbor search. Each row averages ~100KB.

libSQL quirk: SELECT COUNT(*) FROM embeddings returns 0. Always count a specific column:

SELECT COUNT(chunk_id) FROM embeddings  -- correct

How It Works

Extract - PDF text via pdf-parse, Markdown parsed directly
Enrich (optional) - LLM extracts metadata, matches taxonomy concepts
Chunk - Text split into ~512 token chunks with overlap
Embed - Each chunk embedded via Ollama (1024 dimensions)
Store - libSQL with vector index (HNSW) + FTS5
Search - Query embedded, compared via cosine similarity

MCP Integration

pdf-brain ships as an MCP server for AI coding assistants:

{
  "mcpServers": {
    "pdf-brain": {
      "command": "npx",
      "args": ["pdf-brain", "mcp"]
    }
  }
}

Document Tools

Tool	Description
`pdf-brain_add`	Add PDF/Markdown to library (supports URLs)
`pdf-brain_batch_add`	Bulk ingest from directory
`pdf-brain_search`	Unified semantic search (docs + concepts)
`pdf-brain_list`	List documents, optionally filter by tag
`pdf-brain_read`	Get document details and metadata
`pdf-brain_remove`	Remove document from library
`pdf-brain_tag`	Set tags on a document
`pdf-brain_stats`	Library statistics (docs, chunks, embeddings)

Taxonomy Tools

Tool	Description
`pdf-brain_taxonomy_list`	List all concepts (optional tree format)
`pdf-brain_taxonomy_tree`	Visual concept tree with box-drawing
`pdf-brain_taxonomy_add`	Add new concept to taxonomy
`pdf-brain_taxonomy_assign`	Assign concept to document
`pdf-brain_taxonomy_search`	Search concepts by label
`pdf-brain_taxonomy_seed`	Load taxonomy from JSON file

Config Tools

Tool	Description
`pdf-brain_config_show`	Display all config
`pdf-brain_config_get`	Get specific config value
`pdf-brain_config_set`	Set config value

Utility Tools

Tool	Description
`pdf-brain_check`	Check if Ollama is ready
`pdf-brain_repair`	Fix database integrity issues

Troubleshooting

"Ollama not available"

# Check if Ollama is running
curl http://localhost:11434/api/tags

# Start Ollama
ollama serve

# Check models
ollama list

"Model not found"

# Pull required models
ollama pull mxbai-embed-large
ollama pull llama3.2:3b

"Database locked"

The database uses WAL mode. If you see lock errors:

# Check for zombie processes
lsof ~/Documents/.pdf-library/library.db*

# Force checkpoint
sqlite3 ~/Documents/.pdf-library/library.db "PRAGMA wal_checkpoint(TRUNCATE);"

Slow enrichment

Enrichment is CPU-intensive. For large batches:

Use --auto-tag instead of --enrich for faster processing
Run overnight for large libraries
Consider GPU acceleration for Ollama

Development

# Clone
git clone https://github.com/joelhooks/pdf-brain
cd pdf-brain

# Install
bun install

# Run CLI
bun run src/cli.ts <command>

# Run tests
bun test

# Type check
bun run typecheck

License

MIT

Keywords

FAQs

What is pdf-brain?

Is pdf-brain well maintained?

Package last updated on 07 Feb 2026

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

pdf-brain

pdf-brain

Features

Quick Start

Installation

Prerequisites

Models

Install pdf-brain

CLI Reference

Agent Output (Default)

Basic Commands

Adding Documents

Searching

Managing Documents

Taxonomy Commands

Bulk Ingest

Enrichment

LLM Providers

Enrichment Fallback

Taxonomy

Growing Your Taxonomy

Custom Taxonomy

Configuration

Config File

Config Options

Environment Variables

AI Gateway

Storage

Database Size

How It Works

MCP Integration

Document Tools

Taxonomy Tools

Config Tools

Utility Tools

Troubleshooting

"Ollama not available"

"Model not found"

"Database locked"

Slow enrichment

Development

License

Keywords

Related posts

Axios Maintainer Confirms Social Engineering Attack Behind npm Compromise

Node.js Drops Bug Bounty Rewards After Funding Dries Up