
Security News
CISA Kills Off RSS Feeds for KEVs and Cyber Alerts
CISA is discontinuing official RSS support for KEV and cybersecurity alerts, shifting updates to email and social media, disrupting automation workflows.
Powerful Python library to convert documents (PDF, DOCX, TXT) into structured JSON trees for legal, institutional, and NLP applications.
Convert documents into structured JSON effortlessly.
A Python library for extracting text from various document formats and structuring it hierarchically into JSON.
is_leaf=True
.pip install doc23
To enable OCR:
sudo apt install tesseract-ocr
pip install pytesseract
from doc23 import extract_text
# Extract text from any supported document
text = extract_text("document.pdf", scan_or_image="auto")
print(text)
from doc23 import Doc23, Config, LevelConfig
config = Config(
root_name="art_of_war",
sections_field="chapters",
description_field="description",
levels={
"chapter": LevelConfig(
pattern=r"^CHAPTER\s+([IVXLCDM]+)\n(.+)$",
name="chapter",
title_field="title",
description_field="description",
sections_field="paragraphs"
),
"paragraph": LevelConfig(
pattern=r"^(\d+)\.\s+(.+)$",
name="paragraph",
title_field="number",
description_field="text",
is_leaf=True
)
}
)
with open("art_of_war.txt") as f:
text = f.read()
doc = Doc23(text, config)
structure = doc.prune()
print(structure["chapters"][0]["title"]) # โ I
{
"description": "",
"chapters": [
{
"type": "chapter",
"title": "I",
"description": "Laying Plans",
"paragraphs": [
{
"type": "paragraph",
"number": "1",
"text": "Sun Tzu said: The art of war is of vital importance to the State."
}
]
}
]
}
Use Config
and LevelConfig
to define how your document is parsed:
Field | Purpose |
---|---|
pattern | Regex to match each level |
title_field | Field to assign the first regex group |
description_field | (Optional) Field for second group |
sections_field | (Optional) Where sublevels go |
paragraph_field | (Optional) Where text/nodes go if leaf |
is_leaf | (Optional) Forces this level to be terminal |
Fields Defined | Required Groups in Regex |
---|---|
title_field only | โฅ1 |
title_field + description_field | โฅ2 |
title_field + paragraph_field | โฅ1 (second group optional) |
doc23 consists of several key components:
Doc23 (core.py)
โโโ Extractors (extractors/)
โ โโโ PDFExtractor
โ โโโ DocxExtractor
โ โโโ TextExtractor
โ โโโ ...
โโโ Config (config_tree.py)
โ โโโ LevelConfig
โโโ Gardener (gardener.py)
The library validates your config when creating Doc23
:
If any issue is found, a ValueError
will be raised immediately.
The library includes a comprehensive test suite covering various scenarios:
def test_gardener_initialization():
config = Config(
root_name="document",
sections_field="sections",
description_field="description",
levels={
"book": LevelConfig(
pattern=r"^BOOK\s+(.+)$",
name="book",
title_field="title",
description_field="description",
sections_field="sections"
),
"article": LevelConfig(
pattern=r"^ARTICLE\s+(\d+)\.\s*(.*)$",
name="article",
title_field="title",
description_field="content",
paragraph_field="paragraphs",
parent="book"
)
}
)
gardener = Gardener(config)
assert gardener.leaf == "article"
def test_prune_basic_structure():
config = Config(
root_name="document",
sections_field="sections",
description_field="description",
levels={
"book": LevelConfig(
pattern=r"^BOOK\s+(.+)$",
name="book",
title_field="title",
description_field="description",
sections_field="sections"
),
"article": LevelConfig(
pattern=r"^ARTICLE\s+(\d+)\.\s*(.*)$",
name="article",
title_field="title",
description_field="content",
paragraph_field="paragraphs",
parent="book"
)
}
)
gardener = Gardener(config)
text = """BOOK First Book
This is a description
ARTICLE 1. First article
This is article content
More content"""
result = gardener.prune(text)
assert result["sections"][0]["title"] == "First Book"
assert result["sections"][0]["sections"][0]["paragraphs"] == ["This is article content", "More content"]
def test_prune_empty_document():
config = Config(
root_name="document",
sections_field="sections",
description_field="description",
levels={}
)
gardener = Gardener(config)
result = gardener.prune("")
assert result["sections"] == []
def test_prune_with_free_text():
config = Config(
root_name="document",
sections_field="sections",
description_field="description",
levels={
"title": LevelConfig(
pattern=r"^TITLE\s+(.+)$",
name="title",
title_field="title",
description_field="description",
sections_field="sections"
)
}
)
gardener = Gardener(config)
text = """This is free text at the top level
TITLE First Title
Title description"""
result = gardener.prune(text)
assert result["description"] == "This is free text at the top level"
Run tests with:
python -m pytest tests/
Make sure Tesseract is installed and accessible in your PATH.
Different document formats may require specific libraries. Check your dependencies:
Test your patterns with tools like regex101.com and ensure you have the correct number of capture groups.
Contributions are welcome! Please follow these steps:
git checkout -b feature/amazing-feature
)git commit -m 'Add some amazing feature'
)git push origin feature/amazing-feature
)MIT
For advanced patterns, dynamic configs, exception handling and OCR examples, see:
FAQs
Powerful Python library to convert documents (PDF, DOCX, TXT) into structured JSON trees for legal, institutional, and NLP applications.
We found that doc23 demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago.ย It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
CISA is discontinuing official RSS support for KEV and cybersecurity alerts, shifting updates to email and social media, disrupting automation workflows.
Security News
The MCP community is launching an official registry to standardize AI tool discovery and let agents dynamically find and install MCP servers.
Research
Security News
Socket uncovers an npm Trojan stealing crypto wallets and BullX credentials via obfuscated code and Telegram exfiltration.