
Security News
Attackers Are Hunting High-Impact Node.js Maintainers in a Coordinated Social Engineering Campaign
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.
Local PDF & Markdown knowledge base with semantic search, AI enrichment, and SKOS taxonomy
Local PDF & Markdown knowledge base with semantic search and AI-powered enrichment.
Works with PDFs AND Markdown files - Index your research papers, books, notes, docs, and any
.mdfiles in one unified, searchable knowledge base.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ PDF / MD │────▶│ Ollama │────▶│ Ollama │────▶│ libSQL │
│ (extract) │ │ (LLM) │ │ (embeddings)│ │ (vectors) │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │ │ │
pdf-parse llama3.2:3b mxbai-embed HNSW index
+ markdown enrichment 1024 dims cosine sim
.pdf and .md files with the same workflowNote:
pdf-brainis agent-first and emits a single JSON envelope to stdout by default.
Use--format textfor human-readable output (and TUI/progress rendering), or inspect the machine contract viapdf-brain capabilities.
# 1. Install (standalone binary, no runtime needed)
curl -fsSL https://raw.githubusercontent.com/joelhooks/pdf-brain/main/scripts/install.sh | bash
# 2. Install Ollama (macOS)
brew install ollama
# 3. Pull required models
ollama pull mxbai-embed-large # embeddings (required)
ollama pull llama3.2:3b # enrichment (optional but recommended)
# 4. Start Ollama
ollama serve
# 5. Initialize (creates DB + seeds starter taxonomy)
pdf-brain init
# 6. Add your first document
pdf-brain add ~/Documents/paper.pdf --enrich
Ollama is required for embeddings. The LLM model is optional but recommended for enrichment.
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download from https://ollama.com/download
# Required: Embedding model (1024 dimensions)
ollama pull mxbai-embed-large
# Recommended: Local LLM for enrichment
ollama pull llama3.2:3b
# Start Ollama server
ollama serve
# Standalone binary (no runtime needed)
curl -fsSL https://raw.githubusercontent.com/joelhooks/pdf-brain/main/scripts/install.sh | bash
# or via npm
npm install -g pdf-brain
pdf-brain is optimized for agentic workflows: stdout is machine-readable by default.
--format json|ndjson|text (default: json)--pretty pretty-print JSON--quiet (alias: --no-hints) omit nextActions--log-level silent|error|info|debug (logs go to stderr)Discover the full command/tool contract (including JSON Schemas) at runtime:
pdf-brain capabilities
# Check Ollama status
pdf-brain check
# Show library stats
pdf-brain stats
# Initialize library (creates DB, seeds taxonomy)
pdf-brain init
# Add a PDF
pdf-brain add /path/to/document.pdf
# Add a Markdown file
pdf-brain add /path/to/notes.md
# Add from URL (PDF or MD)
pdf-brain add https://example.com/paper.pdf
pdf-brain add https://raw.githubusercontent.com/user/repo/main/README.md
# Add with manual tags
pdf-brain add document.pdf --tags "ai,agents,research"
# Add with AI enrichment (extracts title, summary, concepts)
pdf-brain add document.pdf --enrich
pdf-brain add notes.md --enrich
# Semantic search (uses embeddings)
pdf-brain search "context engineering patterns"
# Full-text search only (faster, no embeddings)
pdf-brain search "context engineering" --fts
# Hybrid search (combines both)
pdf-brain search "machine learning" --hybrid
# Limit results
pdf-brain search "query" --limit 5
# Expand context around matches
pdf-brain search "query" --expand 500
# List all documents
pdf-brain list
# List by tag
pdf-brain list --tag ai
# Get document details
pdf-brain read "document-title"
# Remove a document
pdf-brain remove "document-title"
# Update tags
pdf-brain tag "document-title" "new,tags,here"
The taxonomy system uses SKOS (Simple Knowledge Organization System) for hierarchical concept organization.
# List all concepts
pdf-brain taxonomy list
# Show concept tree
pdf-brain taxonomy tree
# Show subtree from a concept
pdf-brain taxonomy tree programming
# Search concepts
pdf-brain taxonomy search "machine learning"
# Add a new concept
pdf-brain taxonomy add ai/transformers --label "Transformers" --broader ai-ml
# Assign concept to document
pdf-brain taxonomy assign "doc-id" "programming/typescript"
# Seed taxonomy from JSON file
pdf-brain taxonomy seed --file data/taxonomy.json
Recursively ingest directories containing PDFs and/or Markdown files:
# Ingest a directory with full LLM enrichment
pdf-brain ingest ~/Documents/papers --enrich
# Ingest your Obsidian vault or notes folder
pdf-brain ingest ~/Documents/obsidian --enrich
# Ingest multiple directories (PDFs, Markdown, mixed)
pdf-brain ingest ~/papers ~/books ~/notes --enrich
# With manual tags
pdf-brain ingest ~/books --tags "books,reference"
# Auto-tag only (faster, heuristics + light LLM)
pdf-brain ingest ~/docs --auto-tag
# Process only first N files (for testing)
pdf-brain ingest ~/papers --enrich --sample 10
# Disable TUI for simple output
pdf-brain ingest ~/papers --enrich --no-tui
Supported formats:
.pdf - Research papers, books, documents.md - Notes, documentation, Obsidian vaults, READMEsWhen you add documents with --enrich, the LLM extracts:
| Field | Description |
|---|---|
| title | Clean, properly formatted title |
| author | Author name(s) if detectable |
| summary | 2-3 sentence summary |
| documentType | book, paper, tutorial, guide, article, etc. |
| category | Primary category |
| tags | 5-10 descriptive tags |
| concepts | Matched concepts from your taxonomy |
| proposedConcepts | New concepts the LLM suggests adding |
Enrichment supports multiple providers via the config system:
# Check current config
pdf-brain config show
# Use local Ollama (default)
pdf-brain config set enrichment.provider ollama
pdf-brain config set enrichment.model llama3.2:3b
# Use AI Gateway (Anthropic, OpenAI, etc.)
pdf-brain config set enrichment.provider gateway
pdf-brain config set enrichment.model anthropic/claude-haiku-4-5
export AI_GATEWAY_API_KEY=your-key
# Provider priority: config > CLI flag > auto-detect
pdf-brain add paper.pdf --enrich # uses config
pdf-brain add paper.pdf --enrich --provider ollama # override
If LLM enrichment fails (API error, rate limit, malformed response), pdf-brain automatically falls back to heuristic-based enrichment:
The actual error is logged so you can debug provider issues.
The taxonomy is a hierarchical concept system for organizing documents. It ships with a starter taxonomy covering:
When enriching documents, the LLM may propose new concepts. These are saved for review:
# See proposed concepts from enrichment
pdf-brain taxonomy proposed
# Accept a specific concept
pdf-brain taxonomy accept ai/rag --broader ai-ml
# Accept all proposed concepts
pdf-brain taxonomy accept --all
# Reject a concept
pdf-brain taxonomy reject ai/rag
# Clear all proposals
pdf-brain taxonomy clear-proposed
# Manually add a concept
pdf-brain taxonomy add ai/rag --label "RAG" --broader ai-ml
# Or edit data/taxonomy.json and re-seed
pdf-brain taxonomy seed --file data/taxonomy.json
Create your own taxonomy.json:
{
"concepts": [
{ "id": "cooking", "prefLabel": "Cooking" },
{ "id": "cooking/baking", "prefLabel": "Baking" },
{ "id": "cooking/grilling", "prefLabel": "Grilling" }
],
"hierarchy": [
{ "conceptId": "cooking/baking", "broaderId": "cooking" },
{ "conceptId": "cooking/grilling", "broaderId": "cooking" }
]
}
pdf-brain taxonomy seed --file my-taxonomy.json
pdf-brain stores configuration in $PDF_LIBRARY_PATH/config.json:
# Show all config
pdf-brain config show
# Get a specific value
pdf-brain config get enrichment.provider
# Set a value
pdf-brain config set enrichment.model anthropic/claude-haiku-4-5
{
"ollama": {
"host": "http://localhost:11434"
},
"embedding": {
"provider": "ollama",
"model": "mxbai-embed-large"
},
"enrichment": {
"provider": "gateway",
"model": "anthropic/claude-haiku-4-5"
},
"judge": {
"provider": "gateway",
"model": "anthropic/claude-haiku-4-5"
}
}
| Setting | Default | Description |
|---|---|---|
ollama.host | http://localhost:11434 | Ollama API endpoint |
embedding.provider | ollama | Embedding provider (ollama only) |
embedding.model | mxbai-embed-large | Embedding model (1024 dims) |
enrichment.provider | ollama | LLM provider: ollama or gateway |
enrichment.model | llama3.2:3b | Model for document enrichment |
judge.provider | ollama | Provider for concept deduplication |
judge.model | llama3.2:3b | Model for judging duplicate concepts |
| Variable | Default | Description |
|---|---|---|
PDF_LIBRARY_PATH | ~/Documents/.pdf-library | Library storage location |
OLLAMA_HOST | http://localhost:11434 | Ollama API endpoint |
AI_GATEWAY_API_KEY | - | API key for AI Gateway |
PDF_BRAIN_LOG_LEVEL | silent | stderr logging verbosity |
PDF_BRAIN_QUERY_EMBED_CACHE_SIZE | 256 | Query embedding LRU cache size (0 disables) |
For cloud LLM providers (Anthropic, OpenAI, etc.), use the AI Gateway:
# Set your API key
export AI_GATEWAY_API_KEY=your-key
# Configure to use gateway
pdf-brain config set enrichment.provider gateway
pdf-brain config set enrichment.model anthropic/claude-haiku-4-5
# Other supported models:
# - anthropic/claude-sonnet-4-20250514
# - openai/gpt-4o-mini
# - openai/gpt-4o
~/Documents/.pdf-library/
├── library.db # libSQL database (vectors, FTS, metadata, taxonomy)
├── library.db-shm # Shared memory (WAL mode)
├── library.db-wal # Write-ahead log
└── downloads/ # PDFs downloaded from URLs
The database can get large due to vector index overhead. For ~500k chunks:
| Component | Size | Notes |
|---|---|---|
| Text content | ~180MB | Actual chunk text |
| Embeddings | ~1.9GB | 500k × 1024 dims × 4 bytes |
| Vector index | ~48GB | HNSW neighbor graphs (~100KB/row) |
| FTS index | ~200MB | Full-text search |
The *_idx_shadow tables store HNSW neighbor graphs for approximate nearest neighbor search. Each row averages ~100KB.
libSQL quirk: SELECT COUNT(*) FROM embeddings returns 0. Always count a specific column:
SELECT COUNT(chunk_id) FROM embeddings -- correct
pdf-parse, Markdown parsed directlypdf-brain ships as an MCP server for AI coding assistants:
{
"mcpServers": {
"pdf-brain": {
"command": "npx",
"args": ["pdf-brain", "mcp"]
}
}
}
| Tool | Description |
|---|---|
pdf-brain_add | Add PDF/Markdown to library (supports URLs) |
pdf-brain_batch_add | Bulk ingest from directory |
pdf-brain_search | Unified semantic search (docs + concepts) |
pdf-brain_list | List documents, optionally filter by tag |
pdf-brain_read | Get document details and metadata |
pdf-brain_remove | Remove document from library |
pdf-brain_tag | Set tags on a document |
pdf-brain_stats | Library statistics (docs, chunks, embeddings) |
| Tool | Description |
|---|---|
pdf-brain_taxonomy_list | List all concepts (optional tree format) |
pdf-brain_taxonomy_tree | Visual concept tree with box-drawing |
pdf-brain_taxonomy_add | Add new concept to taxonomy |
pdf-brain_taxonomy_assign | Assign concept to document |
pdf-brain_taxonomy_search | Search concepts by label |
pdf-brain_taxonomy_seed | Load taxonomy from JSON file |
| Tool | Description |
|---|---|
pdf-brain_config_show | Display all config |
pdf-brain_config_get | Get specific config value |
pdf-brain_config_set | Set config value |
| Tool | Description |
|---|---|
pdf-brain_check | Check if Ollama is ready |
pdf-brain_repair | Fix database integrity issues |
# Check if Ollama is running
curl http://localhost:11434/api/tags
# Start Ollama
ollama serve
# Check models
ollama list
# Pull required models
ollama pull mxbai-embed-large
ollama pull llama3.2:3b
The database uses WAL mode. If you see lock errors:
# Check for zombie processes
lsof ~/Documents/.pdf-library/library.db*
# Force checkpoint
sqlite3 ~/Documents/.pdf-library/library.db "PRAGMA wal_checkpoint(TRUNCATE);"
Enrichment is CPU-intensive. For large batches:
--auto-tag instead of --enrich for faster processing# Clone
git clone https://github.com/joelhooks/pdf-brain
cd pdf-brain
# Install
bun install
# Run CLI
bun run src/cli.ts <command>
# Run tests
bun test
# Type check
bun run typecheck
MIT
FAQs
Local PDF & Markdown knowledge base with semantic search, AI enrichment, and SKOS taxonomy
We found that pdf-brain demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.

Security News
Axios compromise traced to social engineering, showing how attacks on maintainers can bypass controls and expose the broader software supply chain.

Security News
Node.js has paused its bug bounty program after funding ended, removing payouts for vulnerability reports but keeping its security process unchanged.