New Research: Supply Chain Attack on Axios Pulls Malicious Dependency from npm.Details →
Socket
Book a DemoSign in
Socket

pdf-brain

Package Overview
Dependencies
Maintainers
1
Versions
17
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

pdf-brain

Local PDF & Markdown knowledge base with semantic search, AI enrichment, and SKOS taxonomy

latest
npmnpm
Version
2.0.0
Version published
Maintainers
1
Created
Source

pdf-brain

Local PDF & Markdown knowledge base with semantic search and AI-powered enrichment.

Works with PDFs AND Markdown files - Index your research papers, books, notes, docs, and any .md files in one unified, searchable knowledge base.

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  PDF / MD   │────▶│   Ollama    │────▶│   Ollama    │────▶│   libSQL    │
│  (extract)  │     │    (LLM)    │     │ (embeddings)│     │  (vectors)  │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
      │                   │                   │                   │
   pdf-parse         llama3.2:3b        mxbai-embed          HNSW index
   + markdown        enrichment          1024 dims           cosine sim

Features

  • PDF + Markdown - Index .pdf and .md files with the same workflow
  • Local-first - Everything runs on your machine, no API costs
  • AI enrichment - LLM extracts titles, summaries, tags, and concepts
  • SKOS taxonomy - Organize documents with hierarchical concepts
  • Vector search - Semantic search via Ollama embeddings
  • Hybrid search - Combine vector similarity with full-text search
  • MCP server - Use with Claude, Cursor, and other AI assistants

Quick Start

Note: pdf-brain is agent-first and emits a single JSON envelope to stdout by default.
Use --format text for human-readable output (and TUI/progress rendering), or inspect the machine contract via pdf-brain capabilities.

# 1. Install (standalone binary, no runtime needed)
curl -fsSL https://raw.githubusercontent.com/joelhooks/pdf-brain/main/scripts/install.sh | bash

# 2. Install Ollama (macOS)
brew install ollama

# 3. Pull required models
ollama pull mxbai-embed-large   # embeddings (required)
ollama pull llama3.2:3b         # enrichment (optional but recommended)

# 4. Start Ollama
ollama serve

# 5. Initialize (creates DB + seeds starter taxonomy)
pdf-brain init

# 6. Add your first document
pdf-brain add ~/Documents/paper.pdf --enrich

Installation

Prerequisites

Ollama is required for embeddings. The LLM model is optional but recommended for enrichment.

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download

Models

# Required: Embedding model (1024 dimensions)
ollama pull mxbai-embed-large

# Recommended: Local LLM for enrichment
ollama pull llama3.2:3b

# Start Ollama server
ollama serve

Install pdf-brain

# Standalone binary (no runtime needed)
curl -fsSL https://raw.githubusercontent.com/joelhooks/pdf-brain/main/scripts/install.sh | bash

# or via npm
npm install -g pdf-brain

CLI Reference

Agent Output (Default)

pdf-brain is optimized for agentic workflows: stdout is machine-readable by default.

  • --format json|ndjson|text (default: json)
  • --pretty pretty-print JSON
  • --quiet (alias: --no-hints) omit nextActions
  • --log-level silent|error|info|debug (logs go to stderr)

Discover the full command/tool contract (including JSON Schemas) at runtime:

pdf-brain capabilities

Basic Commands

# Check Ollama status
pdf-brain check

# Show library stats
pdf-brain stats

# Initialize library (creates DB, seeds taxonomy)
pdf-brain init

Adding Documents

# Add a PDF
pdf-brain add /path/to/document.pdf

# Add a Markdown file
pdf-brain add /path/to/notes.md

# Add from URL (PDF or MD)
pdf-brain add https://example.com/paper.pdf
pdf-brain add https://raw.githubusercontent.com/user/repo/main/README.md

# Add with manual tags
pdf-brain add document.pdf --tags "ai,agents,research"

# Add with AI enrichment (extracts title, summary, concepts)
pdf-brain add document.pdf --enrich
pdf-brain add notes.md --enrich

Searching

# Semantic search (uses embeddings)
pdf-brain search "context engineering patterns"

# Full-text search only (faster, no embeddings)
pdf-brain search "context engineering" --fts

# Hybrid search (combines both)
pdf-brain search "machine learning" --hybrid

# Limit results
pdf-brain search "query" --limit 5

# Expand context around matches
pdf-brain search "query" --expand 500

Managing Documents

# List all documents
pdf-brain list

# List by tag
pdf-brain list --tag ai

# Get document details
pdf-brain read "document-title"

# Remove a document
pdf-brain remove "document-title"

# Update tags
pdf-brain tag "document-title" "new,tags,here"

Taxonomy Commands

The taxonomy system uses SKOS (Simple Knowledge Organization System) for hierarchical concept organization.

# List all concepts
pdf-brain taxonomy list

# Show concept tree
pdf-brain taxonomy tree

# Show subtree from a concept
pdf-brain taxonomy tree programming

# Search concepts
pdf-brain taxonomy search "machine learning"

# Add a new concept
pdf-brain taxonomy add ai/transformers --label "Transformers" --broader ai-ml

# Assign concept to document
pdf-brain taxonomy assign "doc-id" "programming/typescript"

# Seed taxonomy from JSON file
pdf-brain taxonomy seed --file data/taxonomy.json

Bulk Ingest

Recursively ingest directories containing PDFs and/or Markdown files:

# Ingest a directory with full LLM enrichment
pdf-brain ingest ~/Documents/papers --enrich

# Ingest your Obsidian vault or notes folder
pdf-brain ingest ~/Documents/obsidian --enrich

# Ingest multiple directories (PDFs, Markdown, mixed)
pdf-brain ingest ~/papers ~/books ~/notes --enrich

# With manual tags
pdf-brain ingest ~/books --tags "books,reference"

# Auto-tag only (faster, heuristics + light LLM)
pdf-brain ingest ~/docs --auto-tag

# Process only first N files (for testing)
pdf-brain ingest ~/papers --enrich --sample 10

# Disable TUI for simple output
pdf-brain ingest ~/papers --enrich --no-tui

Supported formats:

  • .pdf - Research papers, books, documents
  • .md - Notes, documentation, Obsidian vaults, READMEs

Enrichment

When you add documents with --enrich, the LLM extracts:

FieldDescription
titleClean, properly formatted title
authorAuthor name(s) if detectable
summary2-3 sentence summary
documentTypebook, paper, tutorial, guide, article, etc.
categoryPrimary category
tags5-10 descriptive tags
conceptsMatched concepts from your taxonomy
proposedConceptsNew concepts the LLM suggests adding

LLM Providers

Enrichment supports multiple providers via the config system:

# Check current config
pdf-brain config show

# Use local Ollama (default)
pdf-brain config set enrichment.provider ollama
pdf-brain config set enrichment.model llama3.2:3b

# Use AI Gateway (Anthropic, OpenAI, etc.)
pdf-brain config set enrichment.provider gateway
pdf-brain config set enrichment.model anthropic/claude-haiku-4-5
export AI_GATEWAY_API_KEY=your-key

# Provider priority: config > CLI flag > auto-detect
pdf-brain add paper.pdf --enrich              # uses config
pdf-brain add paper.pdf --enrich --provider ollama  # override

Enrichment Fallback

If LLM enrichment fails (API error, rate limit, malformed response), pdf-brain automatically falls back to heuristic-based enrichment:

  • Title: Cleaned from filename
  • Tags: Extracted from path, filename, and content keywords
  • Category: Inferred from directory structure

The actual error is logged so you can debug provider issues.

Taxonomy

The taxonomy is a hierarchical concept system for organizing documents. It ships with a starter taxonomy covering:

  • Programming - TypeScript, React, Next.js, Testing, Architecture, DevOps, AI/ML
  • Education - Instructional Design, Learning Science, Course Creation, Assessment
  • Business - Marketing, Copywriting, Bootstrapping, Product, Sales
  • Design - UX, Visual Design, Systems Thinking, Information Architecture
  • Meta - Productivity, Note-taking, Knowledge Management, Writing

Growing Your Taxonomy

When enriching documents, the LLM may propose new concepts. These are saved for review:

# See proposed concepts from enrichment
pdf-brain taxonomy proposed

# Accept a specific concept
pdf-brain taxonomy accept ai/rag --broader ai-ml

# Accept all proposed concepts
pdf-brain taxonomy accept --all

# Reject a concept
pdf-brain taxonomy reject ai/rag

# Clear all proposals
pdf-brain taxonomy clear-proposed

# Manually add a concept
pdf-brain taxonomy add ai/rag --label "RAG" --broader ai-ml

# Or edit data/taxonomy.json and re-seed
pdf-brain taxonomy seed --file data/taxonomy.json

Custom Taxonomy

Create your own taxonomy.json:

{
  "concepts": [
    { "id": "cooking", "prefLabel": "Cooking" },
    { "id": "cooking/baking", "prefLabel": "Baking" },
    { "id": "cooking/grilling", "prefLabel": "Grilling" }
  ],
  "hierarchy": [
    { "conceptId": "cooking/baking", "broaderId": "cooking" },
    { "conceptId": "cooking/grilling", "broaderId": "cooking" }
  ]
}
pdf-brain taxonomy seed --file my-taxonomy.json

Configuration

Config File

pdf-brain stores configuration in $PDF_LIBRARY_PATH/config.json:

# Show all config
pdf-brain config show

# Get a specific value
pdf-brain config get enrichment.provider

# Set a value
pdf-brain config set enrichment.model anthropic/claude-haiku-4-5

Config Options

{
  "ollama": {
    "host": "http://localhost:11434"
  },
  "embedding": {
    "provider": "ollama",
    "model": "mxbai-embed-large"
  },
  "enrichment": {
    "provider": "gateway",
    "model": "anthropic/claude-haiku-4-5"
  },
  "judge": {
    "provider": "gateway",
    "model": "anthropic/claude-haiku-4-5"
  }
}
SettingDefaultDescription
ollama.hosthttp://localhost:11434Ollama API endpoint
embedding.providerollamaEmbedding provider (ollama only)
embedding.modelmxbai-embed-largeEmbedding model (1024 dims)
enrichment.providerollamaLLM provider: ollama or gateway
enrichment.modelllama3.2:3bModel for document enrichment
judge.providerollamaProvider for concept deduplication
judge.modelllama3.2:3bModel for judging duplicate concepts

Environment Variables

VariableDefaultDescription
PDF_LIBRARY_PATH~/Documents/.pdf-libraryLibrary storage location
OLLAMA_HOSThttp://localhost:11434Ollama API endpoint
AI_GATEWAY_API_KEY-API key for AI Gateway
PDF_BRAIN_LOG_LEVELsilentstderr logging verbosity
PDF_BRAIN_QUERY_EMBED_CACHE_SIZE256Query embedding LRU cache size (0 disables)

AI Gateway

For cloud LLM providers (Anthropic, OpenAI, etc.), use the AI Gateway:

# Set your API key
export AI_GATEWAY_API_KEY=your-key

# Configure to use gateway
pdf-brain config set enrichment.provider gateway
pdf-brain config set enrichment.model anthropic/claude-haiku-4-5

# Other supported models:
# - anthropic/claude-sonnet-4-20250514
# - openai/gpt-4o-mini
# - openai/gpt-4o

Storage

~/Documents/.pdf-library/
├── library.db          # libSQL database (vectors, FTS, metadata, taxonomy)
├── library.db-shm      # Shared memory (WAL mode)
├── library.db-wal      # Write-ahead log
└── downloads/          # PDFs downloaded from URLs

Database Size

The database can get large due to vector index overhead. For ~500k chunks:

ComponentSizeNotes
Text content~180MBActual chunk text
Embeddings~1.9GB500k × 1024 dims × 4 bytes
Vector index~48GBHNSW neighbor graphs (~100KB/row)
FTS index~200MBFull-text search

The *_idx_shadow tables store HNSW neighbor graphs for approximate nearest neighbor search. Each row averages ~100KB.

libSQL quirk: SELECT COUNT(*) FROM embeddings returns 0. Always count a specific column:

SELECT COUNT(chunk_id) FROM embeddings  -- correct

How It Works

  • Extract - PDF text via pdf-parse, Markdown parsed directly
  • Enrich (optional) - LLM extracts metadata, matches taxonomy concepts
  • Chunk - Text split into ~512 token chunks with overlap
  • Embed - Each chunk embedded via Ollama (1024 dimensions)
  • Store - libSQL with vector index (HNSW) + FTS5
  • Search - Query embedded, compared via cosine similarity

MCP Integration

pdf-brain ships as an MCP server for AI coding assistants:

{
  "mcpServers": {
    "pdf-brain": {
      "command": "npx",
      "args": ["pdf-brain", "mcp"]
    }
  }
}

Document Tools

ToolDescription
pdf-brain_addAdd PDF/Markdown to library (supports URLs)
pdf-brain_batch_addBulk ingest from directory
pdf-brain_searchUnified semantic search (docs + concepts)
pdf-brain_listList documents, optionally filter by tag
pdf-brain_readGet document details and metadata
pdf-brain_removeRemove document from library
pdf-brain_tagSet tags on a document
pdf-brain_statsLibrary statistics (docs, chunks, embeddings)

Taxonomy Tools

ToolDescription
pdf-brain_taxonomy_listList all concepts (optional tree format)
pdf-brain_taxonomy_treeVisual concept tree with box-drawing
pdf-brain_taxonomy_addAdd new concept to taxonomy
pdf-brain_taxonomy_assignAssign concept to document
pdf-brain_taxonomy_searchSearch concepts by label
pdf-brain_taxonomy_seedLoad taxonomy from JSON file

Config Tools

ToolDescription
pdf-brain_config_showDisplay all config
pdf-brain_config_getGet specific config value
pdf-brain_config_setSet config value

Utility Tools

ToolDescription
pdf-brain_checkCheck if Ollama is ready
pdf-brain_repairFix database integrity issues

Troubleshooting

"Ollama not available"

# Check if Ollama is running
curl http://localhost:11434/api/tags

# Start Ollama
ollama serve

# Check models
ollama list

"Model not found"

# Pull required models
ollama pull mxbai-embed-large
ollama pull llama3.2:3b

"Database locked"

The database uses WAL mode. If you see lock errors:

# Check for zombie processes
lsof ~/Documents/.pdf-library/library.db*

# Force checkpoint
sqlite3 ~/Documents/.pdf-library/library.db "PRAGMA wal_checkpoint(TRUNCATE);"

Slow enrichment

Enrichment is CPU-intensive. For large batches:

  • Use --auto-tag instead of --enrich for faster processing
  • Run overnight for large libraries
  • Consider GPU acceleration for Ollama

Development

# Clone
git clone https://github.com/joelhooks/pdf-brain
cd pdf-brain

# Install
bun install

# Run CLI
bun run src/cli.ts <command>

# Run tests
bun test

# Type check
bun run typecheck

License

MIT

Keywords

pdf

FAQs

Package last updated on 07 Feb 2026

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts