
About
Docprompt is a library for Document AI. It aims to make enterprise-level document analysis easy thanks to the zero-shot capability of large language models.
Supercharged Document Analysis
- Common utilities for interacting with PDFs
- PDF loading and serialization
- PDF byte compression using Ghostscript :ghost:
- Fast rasterization :fire: :rocket:
- Page splitting, re-export with PDFium
- Document Search, powered by Rust :fire:
- Support for most OCR providers with batched inference
- Google :white_check_mark:
- Amazon Textract :white_check_mark:
- Tesseract :white_check_mark:
- Azure Document Intelligence :red_circle:
- Layout Aware Page Representation
- Run Document Layout Analysis with text-only LLM's!
- Prompt Garden for common document analysis tasks zero-shot, including:
- Markerization (Pdf2Markdown)
- Table Extraction
- Page Classification
- Key-value extraction (Coming soon)
- Segmentation (Coming soon)
Documents and large language models
Features
- Representations for common document layout types -
TextBlock
, BoundingBox
, etc - Generic implementations of OCR providers
- Document Search powered by Rust and R-trees :fire:
- Table Extraction, Page Classification, PDF2Markdown
Installation
Use the package manager pip to install Docprompt.
pip install docprompt
With an OCR provider
pip install "docprompt[google]
With search support
pip install "docprompt[search]"
Usage
Simple Operations
from docprompt import load_document
document = load_document("path/to/my.pdf")
page_number = 5
rastered = document.rasterize_page(page_number, dpi=120)
document_2 = document.split(start=125, stop=130)
Converting a PDF to markdown
Coverting documents into markdown is a great way to prepare documents for downstream chunking or ingestion into a RAG system.
from docprompt import load_document_node
from docprompt.tasks.markerize import AnthropicMarkerizeProvider
document_node = load_document_node("path/to/my.pdf")
markerize_provider = AnthropicMarkerizeProvider()
markerized_document = markerize_provider.process_document_node(document_node)
Extract tables with SOTA speed and accuracy.
from docprompt import load_document_node
from docprompt.tasks.table_extraction import AnthropicTableExtractionProvider
document_node = load_document_node("path/to/my.pdf")
table_extraction_provider = AnthropicTableExtractionProvider()
extracted_tables = table_extraction_provider.process_document_node(document_node)
Performing OCR
from docprompt import load_document, DocumentNode
from docprompt.tasks.ocr.gcp import GoogleOcrProvider
provider = GoogleOcrProvider.from_service_account_file(
project_id=my_project_id,
processor_id=my_processor_id,
service_account_file=path_to_service_file
)
document = load_document("path/to/my.pdf")
document_node = DocumentNode.from_document(document)
provider.process_document_node(document_node)
document_node[0].ocr_result
Document Search
When a large language model returns a result, we might want to highlight that result for our users. However, language models return results as text, while what we need to show our users requires a page number and a bounding box.
After extracting text from a PDF, we can support this pattern using DocumentProvenanceLocator
, which lives on a DocumentNode
from docprompt import load_document, DocumentNode
from docprompt.tasks.ocr.gcp import GoogleOcrProvider
provider = GoogleOcrProvider.from_service_account_file(
project_id=my_project_id,
processor_id=my_processor_id,
service_account_file=path_to_service_file
)
document = load_document("path/to/my.pdf")
document_node = DocumentNode.from_document(document)
provider.process_document_node(document_node)
document_node.locator.search("John Doe")
document_node.locator.search("Jane Doe", page_number=4)
This functionality uses a combination of rtree
and the Rust library tantivy
, allowing you to perform thousands of searches in seconds :fire: :rocket: