Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

spacy-layout

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

spacy-layout

Use spaCy with PDFs, Word docs and other documents

  • 0.0.10
  • Source
  • PyPI
  • Socket score

Maintainers
1

spaCy Layout: Process PDFs, Word documents and more with spaCy

This plugin integrates with Docling to bring structured processing of PDFs, Word documents and other input formats to your spaCy pipeline. It outputs clean, structured data in a text-based format and creates spaCy's familiar Doc objects that let you access labelled text spans like sections or headings, and tables with their data converted to a pandas.DataFrame.

This workflow makes it easy to apply powerful NLP techniques to your documents, including linguistic analysis, named entity recognition, text classification and more. It's also great for implementing chunking for RAG pipelines.

📖 Blog post: "From PDFs to AI-ready structured data: a deep dive" – A new modular workflow for converting PDFs and similar documents to structured data, featuring spacy-layout and Docling.

Test Current Release Version pypi Version Built with spaCy

📝 Usage

⚠️ This package requires Python 3.10 or above.

pip install spacy-layout

After initializing the spaCyLayout preprocessor with an nlp object for tokenization, you can call it on a document path to convert it to structured data. The resulting Doc object includes layout spans that map into the original raw text and expose various attributes, including the content type and layout features.

import spacy
from spacy_layout import spaCyLayout

nlp = spacy.blank("en")
layout = spaCyLayout(nlp)

# Process a document and create a spaCy Doc object
doc = layout("./starcraft.pdf")

# The text-based contents of the document
print(doc.text)
# Document layout including pages and page sizes
print(doc._.layout)
# Tables in the document and their extracted data
print(doc._.tables)
# Markdown representation of the document
print(doc._.markdown)

# Layout spans for different sections
for span in doc.spans["layout"]:
    # Document section and token and character offsets into the text
    print(span.text, span.start, span.end, span.start_char, span.end_char)
    # Section type, e.g. "text", "title", "section_header" etc.
    print(span.label_)
    # Layout features of the section, including bounding box
    print(span._.layout)
    # Closest heading to the span (accuracy depends on document structure)
    print(span._.heading)

If you need to process larger volumes of documents at scale, you can use the spaCyLayout.pipe method, which takes an iterable of paths or bytes instead and yields Doc objects:

paths = ["one.pdf", "two.pdf", "three.pdf", ...]
for doc in layout.pipe(paths):
    print(doc._.layout)

After you've processed the documents, you can serialize the structured Doc objects in spaCy's efficient binary format, so you don't have to re-run the resource-intensive conversion.

spaCy also allows you to call the nlp object on an already created Doc, so you can easily apply a pipeline of components for linguistic analysis or named entity recognition, use rule-based matching or anything else you can do with spaCy.

# Load the transformer-based English pipeline
# Installation: python -m spacy download en_core_web_trf
nlp = spacy.load("en_core_web_trf")
layout = spaCyLayout(nlp)

doc = layout("./starcraft.pdf")
# Apply the pipeline to access POS tags, dependencies, entities etc.
doc = nlp(doc)

Tables and tabular data

Tables are included in the layout spans with the label "table" and under the shortcut Doc._.tables. They expose a layout extension attribute, as well as an attribute data, which includes the tabular data converted to a pandas.DataFrame.

for table in doc._.tables:
    # Token position and bounding box
    print(table.start, table.end, table._.layout)
    # pandas.DataFrame of contents
    print(table._.data)

By default, the span text is a placeholder TABLE, but you can customize how a table is rendered by providing a display_table callback to spaCyLayout, which receives the pandas.DataFrame of the data. This allows you to include the table figures in the document text and use them later on, e.g. during information extraction with a trained named entity recognizer or text classifier.

def display_table(df: pd.DataFrame) -> str:
    return f"Table with columns: {', '.join(df.columns.tolist())}"

layout = spaCyLayout(nlp, display_table=display_table)

🎛️ API

Data and extension attributes

layout = spaCyLayout(nlp)
doc = layout("./starcraft.pdf")
print(doc._.layout)
for span in doc.spans["layout"]:
    print(span.label_, span._.layout)
AttributeTypeDescription
Doc._.layoutDocLayoutLayout features of the document.
Doc._.pageslist[tuple[PageLayout, list[Span]]]Pages in the document and the spans they contain.
Doc._.tableslist[Span]All tables in the document.
Doc._.markdownstrMarkdown representation of the document.
Doc.spans["layout"]spacy.tokens.SpanGroupThe layout spans in the document.
Span.label_strThe type of the extracted layout span, e.g. "text" or "section_header". See here for options.
Span.labelintThe integer ID of the span label.
Span.idintRunning index of layout span.
Span._.layoutSpanLayout | NoneLayout features of a layout span.
Span._.headingSpan | NoneClosest heading to a span, if available.
Span._.datapandas.DataFrame | NoneThe extracted data for table spans.

dataclass PageLayout

AttributeTypeDescription
page_nointThe page number (1-indexed).
widthfloatPage with in pixels.
heightfloatPage height in pixels.

dataclass DocLayout

AttributeTypeDescription
pageslist[PageLayout]The pages in the document.

dataclass SpanLayout

AttributeTypeDescription
xfloatHorizontal offset of the bounding box in pixels.
yfloatVertical offset of the bounding box in pixels.
widthfloatWidth of the bounding box in pixels.
heightfloatHeight of the bounding box in pixels.
page_nointNumber of page the span is on.

class spaCyLayout

method spaCyLayout.__init__

Initialize the document processor.

nlp = spacy.blank("en")
layout = spaCyLayout(nlp)
ArgumentTypeDescription
nlpspacy.language.LanguageThe initialized nlp object to use for tokenization.
separatorstrToken used to separate sections in the created Doc object. The separator won't be part of the layout span. If None, no separator will be added. Defaults to "\n\n".
attrsdict[str, str]Override the custom spaCy attributes. Can include "doc_layout", "doc_pages", "doc_tables", "doc_markdown", "span_layout", "span_data", "span_heading" and "span_group".
headingslist[str]Labels of headings to consider for Span._.heading detection. Defaults to ["section_header", "page_header", "title"].
display_tableCallable[[pandas.DataFrame], str] | strFunction to generate the text-based representation of the table in the Doc.text or placeholder text. Defaults to "TABLE".
docling_optionsdict[InputFormat, FormatOption]Format options passed to Docling's DocumentConverter.
RETURNSspaCyLayoutThe initialized object.
method spaCyLayout.__call__

Process a document and create a spaCy Doc object containing the text content and layout spans, available via Doc.spans["layout"] by default.

layout = spaCyLayout(nlp)
doc = layout("./starcraft.pdf")
ArgumentTypeDescription
sourcestr | Path | bytes | DoclingDocumentPath of document to process, bytes or already created DoclingDocument.
RETURNSDocThe processed spaCy Doc object.
method spaCyLayout.pipe

Process multiple documents and create spaCy Doc objects. You should use this method if you're processing larger volumes of documents at scale.

layout = spaCyLayout(nlp)
paths = ["one.pdf", "two.pdf", "three.pdf", ...]
docs = layout.pipe(paths)
ArgumentTypeDescription
sourcesIterable[str | Path | bytes]Paths of documents to process or bytes.
YIELDSDocThe processed spaCy Doc object.

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc