Socket
Book a DemoInstallSign in
Socket

chandra-ocr

Package Overview
Dependencies
Maintainers
1
Versions
7
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

chandra-ocr

OCR model that converts documents to markdown, HTML, or JSON.

pipPyPI
Version
0.1.8
Maintainers
1

Chandra

Chandra is a highly accurate OCR model that converts images and PDFs into structured HTML/Markdown/JSON while preserving layout information.

Features

  • Convert documents to markdown, html, or json with detailed layout information
  • Good handwriting support
  • Reconstructs forms accurately, including checkboxes
  • Good support for tables, math, and complex layouts
  • Extracts images and diagrams, with captions and structured data
  • Support for 40+ languages
  • Two inference modes: local (HuggingFace) and remote (vLLM server)

Hosted API

  • We have a hosted API for Chandra here, which also includes other accuracy improvements and document workflows.
  • There is a free playground here if you want to try it out without installing.

Quickstart

The easiest way to start is with the CLI tools:

pip install chandra-ocr

# With VLLM
chandra_vllm
chandra input.pdf ./output

# With HuggingFace
chandra input.pdf ./output --method hf

# Interactive streamlit app
chandra_app

Benchmarks

These are overall scores on the olmocr bench.

See full scores below.

Examples

TypeNameLink
TablesWater Damage FormView
Tables10K FilingView
FormsHandwritten FormView
FormsLease AgreementView
HandwritingDoctor NoteView
HandwritingMath HomeworkView
BooksGeography TextbookView
BooksExercise ProblemsView
MathAttention DiagramView
MathWorksheetView
MathEGA PageView
NewspapersNew York TimesView
NewspapersLA TimesView
OtherTranscriptView
OtherFlowchartView

Community

Discord is where we discuss future development.

Installation

Package

pip install chandra-ocr

If you're going to use the huggingface method, we also recommend installing flash attention.

From Source

git clone https://github.com/datalab-to/chandra.git
cd chandra
uv sync
source .venv/bin/activate

Usage

CLI

Process single files or entire directories:

# Single file, with vllm server (see below for how to launch vllm)
chandra input.pdf ./output --method vllm

# Process all files in a directory with local model
chandra ./documents ./output --method hf

CLI Options:

  • --method [hf|vllm]: Inference method (default: vllm)
  • --page-range TEXT: Page range for PDFs (e.g., "1-5,7,9-12")
  • --max-output-tokens INTEGER: Max tokens per page
  • --max-workers INTEGER: Parallel workers for vLLM
  • --include-images/--no-images: Extract and save images (default: include)
  • --include-headers-footers/--no-headers-footers: Include page headers/footers (default: exclude)
  • --batch-size INTEGER: Pages per batch (default: 1)

Output Structure:

Each processed file creates a subdirectory with:

  • <filename>.md - Markdown output
  • <filename>.html - HTML output
  • <filename>_metadata.json - Metadata (page info, token count, etc.)
  • images/ - Extracted images from the document

Streamlit Web App

Launch the interactive demo for single-page processing:

chandra_app

vLLM Server (Optional)

For production deployments or batch processing, use the vLLM server:

chandra_vllm

This launches a Docker container with optimized inference settings. Configure via environment variables:

  • VLLM_API_BASE: Server URL (default: http://localhost:8000/v1)
  • VLLM_MODEL_NAME: Model name for the server (default: chandra)
  • VLLM_GPUS: GPU device IDs (default: 0)

You can also start your own vllm server with the datalab-to/chandra model.

Configuration

Settings can be configured via environment variables or a local.env file:

# Model settings
MODEL_CHECKPOINT=datalab-to/chandra
MAX_OUTPUT_TOKENS=8192

# vLLM settings
VLLM_API_BASE=http://localhost:8000/v1
VLLM_MODEL_NAME=chandra
VLLM_GPUS=0

Commercial usage

This code is Apache 2.0, and our model weights use a modified OpenRAIL-M license (free for research, personal use, and startups under $2M funding/revenue, cannot be used competitively with our API). To remove the OpenRAIL license requirements, or for broader commercial licensing, visit our pricing page here.

Benchmark table

ModelArXivOld Scans MathTablesOld ScansHeaders and FootersMulti columnLong tiny textBaseOverallSource
Datalab Chandra v0.1.082.280.388.050.490.881.292.399.983.1 ± 0.9Own benchmarks
Datalab Marker v1.10.083.869.774.832.386.679.485.799.676.5 ± 1.0Own benchmarks
Mistral OCR API77.267.560.629.393.671.377.199.472.0 ± 1.1olmocr repo
Deepseek OCR75.272.379.733.396.166.780.199.775.4 ± 1.0Own benchmarks
GPT-4o (Anchored)53.574.570.040.793.869.360.696.869.9 ± 1.1olmocr repo
Gemini Flash 2 (Anchored)54.556.172.134.264.761.571.595.663.8 ± 1.2olmocr repo
Qwen 3 VL 8B70.275.145.637.589.162.143.094.364.6 ± 1.1Own benchmarks
olmOCR v0.3.078.679.972.943.995.177.381.298.978.5 ± 1.1olmocr repo
dots.ocr82.164.288.340.994.182.481.299.579.1 ± 1.0dots.ocr repo

Credits

Thank you to the following open source projects:

Keywords

ocr

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts