You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP →

Book a Demo Install Sign in

typhoon-ocr

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

typhoon-ocr

A package for extracting structured content from PDFs and images using Typhoon OCR models

0.3.8

PyPI

Maintainers: 1

Typhoon OCR

Typhoon OCR is a model for extracting structured markdown from images or PDFs. It supports document layout analysis and table extraction, returning results in markdown or HTML. This package provides utilities to convert images and PDFs to the format supported by the Typhoon OCR model.

Languages Supported

The Typhoon OCR model supports:

English
Thai

Features

Convert images to PDFs for unified processing
Extract text and layout information from PDFs and images
Generate OCR-ready messages for API processing with Typhoon OCR model
Built-in prompt templates for different document processing tasks
Process specific pages from multi-page PDF documents

Installation

pip install typhoon-ocr

System Requirements

The package requires the Poppler utilities to be installed on your system:

For macOS:

brew install poppler

For Linux:

sudo apt-get update
sudo apt-get install poppler-utils

The following binaries are required:

pdfinfo
pdftoppm

Usage

Core functionality

The package provides 2 main functions:

from typhoon_ocr import ocr_document, prepare_ocr_messages

ocr_document: Full OCR pipeline for Typhoon OCR model via opentyphoon.ai or OpenAI compatible api (such as vllm)
prepare_ocr_messages: Generate complete OCR-ready messages for the Typhoon OCR model

Complete OCR workflow

Use the simplified API to ocr the document or prepare messages for OpenAI compatible api at opentyphoon.ai:

from typhoon_ocr import ocr_document

markdown = ocr_document(
    pdf_or_image_path="document.pdf",  # Works with PDFs or images
    task_type="default",    # Choose between "default" or "structure"
    page_num=2              # Process page 2 of a PDF (default is 1, always 1 for images)
)

# Or with image
markdown = ocr_document(
    pdf_or_image_path="scan.jpg",  # Works with PDFs or images
    task_type="default",    # Choose between "default" or "structure"
)

Prepare the messages manually.

from typhoon_ocr import prepare_ocr_messages
from openai import OpenAI

# Prepare messages for OCR processing
messages = prepare_ocr_messages(
    pdf_or_image_path="document.pdf",  # Works with PDFs or images
    task_type="default",    # Choose between "default" or "structure"
    page_num=2              # Process page 2 of a PDF (default is 1, always 1 for images)
)

# Use with https://opentyphoon.ai/ api or self-host model via vllm
# See model list at https://huggingface.co/collections/scb10x/typhoon-ocr-682713483cb934ab0cf069bd
client = OpenAI(base_url='https://api.opentyphoon.ai/v1')
response = client.chat.completions.create(
    model="typhoon-ocr-preview",
    messages=messages,
    max_tokens=16000,
    extra_body={
        "repetition_penalty": 1.2,
        "temperature": 0.1,
        "top_p": 0.6,
    },

)

# Parse the JSON response
text_output = response.choices[0].message.content
markdown = json.loads(text_output)['natural_text']
print(markdown)

Available task types

The package comes with built-in prompt templates for different OCR tasks:

default: Extracts markdown representation of the document with tables in markdown format
structure: Provides more structured output with HTML tables and image analysis placeholders

Document Extraction Capabilities

The Typhoon OCR model, when used with this package, can extract:

Structured text with proper layout preservation
Tables (in markdown or HTML format)
Document hierarchy (headings, paragraphs, lists)
Text with positional information
Basic image analysis and placement

License

This project code is licensed under the Apache 2.0 License.

Acknowledgments

The code is based on work from OlmoCR under the Apache 2.0 license.

FAQs

What is typhoon-ocr?

Is typhoon-ocr well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

typhoon-ocr

Typhoon OCR

Languages Supported

Features

Installation

System Requirements

For macOS:

For Linux:

Usage

Core functionality

Complete OCR workflow

Available task types

Document Extraction Capabilities

License

Acknowledgments

Related posts

60 Malicious Ruby Gems Used in Targeted Credential Theft Campaign

New CNA Scorecard Tool Ranks CVE Data Quality Across the Ecosystem