
Security News
npm Adopts OIDC for Trusted Publishing in CI/CD Workflows
npm now supports Trusted Publishing with OIDC, enabling secure package publishing directly from CI/CD workflows without relying on long-lived tokens.
A package for extracting structured content from PDFs and images using Typhoon OCR models
Typhoon OCR is a model for extracting structured markdown from images or PDFs. It supports document layout analysis and table extraction, returning results in markdown or HTML. This package provides utilities to convert images and PDFs to the format supported by the Typhoon OCR model.
The Typhoon OCR model supports:
pip install typhoon-ocr
The package requires the Poppler utilities to be installed on your system:
brew install poppler
sudo apt-get update
sudo apt-get install poppler-utils
The following binaries are required:
pdfinfo
pdftoppm
The package provides 2 main functions:
from typhoon_ocr import ocr_document, prepare_ocr_messages
ocr_document
: Full OCR pipeline for Typhoon OCR model via opentyphoon.ai or OpenAI compatible api (such as vllm)prepare_ocr_messages
: Generate complete OCR-ready messages for the Typhoon OCR modelUse the simplified API to ocr the document or prepare messages for OpenAI compatible api at opentyphoon.ai:
from typhoon_ocr import ocr_document
markdown = ocr_document(
pdf_or_image_path="document.pdf", # Works with PDFs or images
task_type="default", # Choose between "default" or "structure"
page_num=2 # Process page 2 of a PDF (default is 1, always 1 for images)
)
# Or with image
markdown = ocr_document(
pdf_or_image_path="scan.jpg", # Works with PDFs or images
task_type="default", # Choose between "default" or "structure"
)
Prepare the messages manually.
from typhoon_ocr import prepare_ocr_messages
from openai import OpenAI
# Prepare messages for OCR processing
messages = prepare_ocr_messages(
pdf_or_image_path="document.pdf", # Works with PDFs or images
task_type="default", # Choose between "default" or "structure"
page_num=2 # Process page 2 of a PDF (default is 1, always 1 for images)
)
# Use with https://opentyphoon.ai/ api or self-host model via vllm
# See model list at https://huggingface.co/collections/scb10x/typhoon-ocr-682713483cb934ab0cf069bd
client = OpenAI(base_url='https://api.opentyphoon.ai/v1')
response = client.chat.completions.create(
model="typhoon-ocr-preview",
messages=messages,
max_tokens=16000,
extra_body={
"repetition_penalty": 1.2,
"temperature": 0.1,
"top_p": 0.6,
},
)
# Parse the JSON response
text_output = response.choices[0].message.content
markdown = json.loads(text_output)['natural_text']
print(markdown)
The package comes with built-in prompt templates for different OCR tasks:
default
: Extracts markdown representation of the document with tables in markdown formatstructure
: Provides more structured output with HTML tables and image analysis placeholdersThe Typhoon OCR model, when used with this package, can extract:
This project code is licensed under the Apache 2.0 License.
The code is based on work from OlmoCR under the Apache 2.0 license.
FAQs
A package for extracting structured content from PDFs and images using Typhoon OCR models
We found that typhoon-ocr demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
npm now supports Trusted Publishing with OIDC, enabling secure package publishing directly from CI/CD workflows without relying on long-lived tokens.
Research
/Security News
A RubyGems malware campaign used 60 malicious packages posing as automation tools to steal credentials from social media and marketing tool users.
Security News
The CNA Scorecard ranks CVE issuers by data completeness, revealing major gaps in patch info and software identifiers across thousands of vulnerabilities.