![Oracle Drags Its Feet in the JavaScript Trademark Dispute](https://cdn.sanity.io/images/cgdhsj6q/production/919c3b22c24f93884c548d60cbb338e819ff2435-1024x1024.webp?w=400&fit=max&auto=format)
Security News
Oracle Drags Its Feet in the JavaScript Trademark Dispute
Oracle seeks to dismiss fraud claims in the JavaScript trademark dispute, delaying the case and avoiding questions about its right to the name.
PlasmaPDF is a Python library for converting from txt spans to x-y positioned tokens in the PAWLs format. It is a utility library used in OpenContracts and PdfRedactor.
To install PlasmaPDF, use pip:
pip install plasmapdf
The PlasmaPDF's PdfDataLayer is a solution to a complex problem: maintaining perfect synchronization between plain text spans and their physical locations in PDFs. Let's break it down:
PdfDataLayer is designed to keep span-based annotations consistent with the underlying PDF tokens such that's it's easy to convert between the two. In today's LLM-powered world, one obvious use case is converting LLM-generated span coordinates to PDF x,y coordinates for annotations and redactions. This requires using the OCR tokens as the source of truth and generating the text layer from the tokens with consistent preprocessing.
The fundamental building-block of our translation layer is the PAWLS token - a format we originally adopted from Allen AI's PAWLS project. Check out detailed typing in types.
PAWLs Token Example:
pawls_tokens = [
{
"page": {"width": 612, "height": 792, "index": 0},
"tokens": [
{"x": 72, "y": 72, "width": 50, "height": 12, "text": "Hello"},
{"x": 130, "y": 72, "width": 50, "height": 12, "text": "World"}
]
}
]
This represents the foundational "source of truth" - the actual positions of text on PDF pages.
We then use pandas DataFrames to create efficient indices:
# Page DataFrame tracks character ranges per page
page_df = pd.DataFrame([
{"Page": 0, "Start": 0, "End": 500},
{"Page": 1, "Start": 501, "End": 1000}
])
# Token DataFrame maps each token to its character position
token_df = pd.DataFrame([
{"Page": 0, "Token_Id": 0, "Char_Start": 0, "Char_End": 5},
{"Page": 0, "Token_Id": 1, "Char_Start": 6, "Char_End": 11}
])
Character-Based Indexing
Token-First Architecture
doc_text
property provided by PdfDataLayer
to search for spans.PdfDataLayer
solves several thorny problems:
This addresses some serious and common real-world document processing challenges and provides a solution that's both powerful and practical.
Start by importing the necessary components:
from plasmapdf.models.PdfDataLayer import build_translation_layer
from plasmapdf.models.types import TextSpan, SpanAnnotation, PawlsPagePythonType
The core of plasmaPDF is the PdfDataLayer
class. You create an instance of this class using
the build_translation_layer
function:
pawls_tokens: list[PawlsPagePythonType] = [
{
"page": {"width": 612, "height": 792, "index": 0},
"tokens": [
{"x": 72, "y": 72, "width": 50, "height": 12, "text": "Hello"},
{"x": 130, "y": 72, "width": 50, "height": 12, "text": "World"}
]
}
]
pdf_data_layer = build_translation_layer(pawls_tokens)
You can extract raw text from a span in the document:
span = TextSpan(id="1", start=0, end=11, text="Hello World")
raw_text = pdf_data_layer.get_raw_text_from_span(span)
print(raw_text) # Output: "Hello World"
To create an annotation:
span_annotation = SpanAnnotation(span=span, annotation_label="GREETING")
oc_annotation = pdf_data_layer.create_opencontract_annotation_from_span(span_annotation)
You can access various pieces of information about the document:
print(pdf_data_layer.doc_text) # Full document text
print(pdf_data_layer.human_friendly_full_text) # Human-readable version of the text
print(pdf_data_layer.page_dataframe) # DataFrame with page information
print(pdf_data_layer.tokens_dataframe) # DataFrame with token information
PlasmaPDF uses hatch
for environment and development workflow management. Here's how to get started:
First, install hatch globally:
pip install hatch
Hatch automatically manages virtual environments for you. To activate the development environment:
hatch shell dev
PlasmaPDF uses pytest for testing. To run tests:
hatch run dev:pytest
For tests with coverage:
hatch run dev:pytest --cov
PlasmaPDF comes with several code quality tools configured:
To format your code using black
and isort
:
hatch run dev:format
To run flake8 linting:
hatch run dev:lint
To run mypy type checking:
hatch run types:check
PlasmaPDF defines several hatch environments in pyproject.toml
:
dev
: Main development environment with testing and formatting toolstypes
: Environment for type checking with mypyEach environment has its own dependencies and scripts defined in pyproject.toml
.
The project follows these standards:
PlasmaPDF can handle multi-page documents. When you create the PdfDataLayer
, make sure to include tokens for all
pages:
multi_page_pawls_tokens = [
{
"page": {"width": 612, "height": 792, "index": 0},
"tokens": [...]
},
{
"page": {"width": 612, "height": 792, "index": 1},
"tokens": [...]
}
]
pdf_data_layer = build_translation_layer(multi_page_pawls_tokens)
If you have a span that potentially crosses page boundaries, you can split it:
long_span = TextSpan(id="2", start=0, end=1000, text="...")
page_aware_spans = pdf_data_layer.split_span_on_pages(long_span)
To create an annotation in the OpenContracts format:
span = TextSpan(id="3", start=0, end=20, text="Important clause here")
span_annotation = SpanAnnotation(span=span, annotation_label="IMPORTANT_CLAUSE")
oc_annotation = pdf_data_layer.create_opencontract_annotation_from_span(span_annotation)
PlasmaPDF includes utility functions for working with job results:
from plasmapdf.utils.utils import package_job_results_to_oc_generated_corpus_type
# Assume you have job_results, possible_span_labels, possible_doc_labels,
# possible_relationship_labels, and suggested_label_set
corpus = package_job_results_to_oc_generated_corpus_type(
job_results,
possible_span_labels,
possible_doc_labels,
possible_relationship_labels,
suggested_label_set
)
This function packages job results into the OpenContracts corpus format.
PlasmaPDF comes with a suite of unit tests. You can run these tests to ensure everything is working correctly:
hatch test
This will run all the tests in the tests
directory.
This quick start guide covers the basics of using PlasmaPDF. For more detailed information, refer to the tests or explore the source code. If you encounter any issues or have questions, please refer to the project's issue tracker or documentation.
FAQs
Annotation generator and search tools for PDF
We found that plasmapdf demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Oracle seeks to dismiss fraud claims in the JavaScript trademark dispute, delaying the case and avoiding questions about its right to the name.
Security News
The Linux Foundation is warning open source developers that compliance with global sanctions is mandatory, highlighting legal risks and restrictions on contributions.
Security News
Maven Central now validates Sigstore signatures, making it easier for developers to verify the provenance of Java packages.