🚨 Shai-Hulud Strikes Again:834 Packages Compromised.Technical Analysis →

Book a Demo Install Sign in

llm-content-extractor

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

llm-content-extractor

A robust content extractor for LLM outputs with support for JSON, XML, HTML, and code blocks

PyPI

Version: 1.0.2

Maintainers: 1

LLM Content Extractor

A robust content extractor for LLM outputs with support for extracting and parsing JSON, XML, HTML, and code blocks from raw strings.

✨ Features

🎯 Multiple Format Support: Extract JSON, XML, HTML, and code blocks
🛡️ Fault Tolerant:
- Automatically handle Markdown code fences (```json ... ```)
- Intelligently extract content embedded in text
- Fix common LLM errors (e.g., trailing commas in JSON)
🏗️ Strategy Pattern: Easy to extend with custom extractors
📦 Simple API: Functional interface, ready to use
🧪 Well Tested: High test coverage for reliability
🔧 Type Safe: Full type annotations support

📦 Installation

Install with pip:

pip install llm-content-extractor

Install with Poetry:

poetry add llm-content-extractor

🚀 Quick Start

Basic Usage

from llm_content_extractor import extract, ContentType

# Extract JSON
json_text = '''
Here's the data you requested:
```json
{
    "name": "Alice",
    "age": 30,
    "hobbies": ["reading", "coding"],
}

'''

result = extract(json_text, ContentType.JSON) print(result) # {'name': 'Alice', 'age': 30, 'hobbies': ['reading', 'coding']}


### JSON Extraction Examples

```python
from llm_content_extractor import extract, ContentType

# 1. JSON with Markdown fence
text1 = '```json\n{"status": "success"}\n```'
extract(text1, ContentType.JSON)  # {'status': 'success'}

# 2. Plain JSON
text2 = '{"status": "success"}'
extract(text2, ContentType.JSON)  # {'status': 'success'}

# 3. JSON embedded in text
text3 = 'The result is: {"status": "success"} - done!'
extract(text3, ContentType.JSON)  # {'status': 'success'}

# 4. JSON with trailing commas (common LLM error)
text4 = '{"items": [1, 2, 3,],}'
extract(text4, ContentType.JSON)  # {'items': [1, 2, 3]}

# 5. Using string content type
extract(text1, "json")  # Also works

XML Extraction Examples

Given text containing a fenced XML block:

A response from the LLM:
```xml
<root>
    <item id="1">First</item>
    <item id="2">Second</item>
</root>


You can extract it with:
```python
from llm_content_extractor import extract, ContentType

# Assuming the text above is in a variable `xml_text`
result = extract(xml_text, ContentType.XML)
print(result)  # Returns cleaned XML string

HTML Extraction Examples

Given text containing a fenced HTML block:

LLM says:
```html
<div class="container">
    <h1>Title</h1>
    <p>Content here</p>
</div>


You can extract it with:
```python
from llm_content_extractor import extract, ContentType

# Assuming the text above is in a variable `html_text`
result = extract(html_text, ContentType.HTML)
print(result)  # Returns cleaned HTML string

Code Block Extraction Examples

1. Extract language-specific code

Given a Python code block:

```python
def greet(name):
    return f"Hello, {name}!"

print(greet("World"))


Extract it by specifying the language:
```python
from llm_content_extractor import extract, ContentType

# Assuming the text above is in a variable `python_code_text`
code = extract(python_code_text, ContentType.CODE, language='python')
print(code)
# Output:
# def greet(name):
#     return f"Hello, {name}!"
#
# print(greet("World"))

2. Extract any code block

Given a generic code block (no language specified):

const x = 42; console.log(x);

Extract it without specifying a language:

# Assuming the text above is in a variable `generic_code_text`
code = extract(generic_code_text, ContentType.CODE)
print(code)  # const x = 42;\nconsole.log(x);

🎨 Advanced Usage

Using Extractor Classes Directly

from llm_content_extractor import JSONExtractor, XMLExtractor

# Use extractor classes directly
json_extractor = JSONExtractor()
result = json_extractor.extract('{"key": "value"}')

xml_extractor = XMLExtractor()
result = xml_extractor.extract('<root><item>test</item></root>')

Custom Extractors

Create custom extractors by inheriting from the ContentExtractor base class:

from llm_content_extractor.base import ContentExtractor
from llm_content_extractor import extract, ContentType, register_extractor
import json

class CustomJSONExtractor(ContentExtractor):
    def extract(self, raw_text: str):
        # Custom extraction logic
        cleaned = raw_text.strip()
        # ... your logic here
        return json.loads(cleaned)

# Register custom extractor
register_extractor(ContentType.JSON, CustomJSONExtractor)

# Use the custom extractor
result = extract(text, ContentType.JSON)

Using Custom Extractor Instances

from llm_content_extractor import extract, JSONExtractor

# Create a custom configured extractor
my_extractor = JSONExtractor(strict=True)

# Pass the extractor instance directly
result = extract(raw_text, ContentType.JSON, extractor=my_extractor)

🧪 Fault Tolerance Features

LLM Content Extractor handles various common issues in LLM outputs:

1. Markdown Code Fences

# ✅ Supports various fence formats
extract('```json\n{"a": 1}\n```', ContentType.JSON)
extract('```JSON\n{"a": 1}\n```', ContentType.JSON)  # Uppercase
extract('```\n{"a": 1}\n```', ContentType.JSON)      # No language identifier

2. Embedded Content

# ✅ Extract content from surrounding text
text = '''
Here is the configuration:
{"enabled": true, "timeout": 30}
This will set the timeout to 30 seconds.
'''
extract(text, ContentType.JSON)  # Successfully extracts

3. JSON Syntax Error Fixing

# ✅ Automatically fix trailing commas
extract('{"items": [1, 2,],}', ContentType.JSON)  # {'items': [1, 2]}
extract('[{"id": 1,}, {"id": 2,}]', ContentType.JSON)  # [{'id': 1}, {'id': 2}]

4. Nested Structures

# ✅ Handle complex nested structures
nested = {
    "user": {
        "profile": {
            "name": "Alice",
            "contacts": ["email", "phone"]
        }
    }
}
# Fully supported

🏗️ Architecture

This project uses the Strategy Pattern:

ContentExtractor (Abstract Base Class)
    ├── JSONExtractor
    ├── XMLExtractor
    ├── HTMLExtractor
    └── CodeBlockExtractor

This design provides:

✅ Easy to add new extractor types
✅ Single responsibility for each extractor
✅ Flexible replacement and extension of extraction logic

📚 API Reference

`extract(raw_text, content_type, language="", extractor=None)`

Main extraction function.

Parameters:

raw_text (str): Raw string output from LLM
content_type (ContentType | str): Content type (JSON, XML, HTML, CODE)
language (str, optional): For CODE type, specify the programming language
extractor (ContentExtractor, optional): Custom extractor instance

Returns:

JSON: dict or list
XML/HTML/CODE: str

Raises:

ValueError: If valid content cannot be extracted
TypeError: If an invalid extractor is provided

`ContentType` Enum

class ContentType(Enum):
    JSON = "json"
    XML = "xml"
    HTML = "html"
    CODE = "code"

Extractor Options

JSONExtractor

JSONExtractor(strict=False)

strict: If True, disable auto-fixing of errors like trailing commas

XMLExtractor

XMLExtractor(validate=True, recover=True)

validate: If True and lxml is available, validate XML syntax
recover: If True, attempt to recover from malformed XML

HTMLExtractor

HTMLExtractor(validate=False, clean=False)

validate: If True, validate HTML structure
clean: If True, clean and normalize HTML

CodeBlockExtractor

CodeBlockExtractor(language="", strict=False)

language: Specific language to extract (e.g., 'python', 'javascript')
strict: If True, only extract fenced code blocks

🔧 Development

Setup

# Clone the repository
git clone https://github.com/aihes/llm-content-extractor.git
cd llm-content-extractor

# Install dependencies
poetry install

# Run tests
poetry run pytest

# Format code
poetry run black .

# Type checking
poetry run mypy llm_content_extractor

Running Tests

# Run all tests
poetry run pytest

# With coverage report
poetry run pytest --cov=llm_content_extractor --cov-report=html

# Run specific tests
poetry run pytest tests/test_json_extractor.py

📖 Publishing to PyPI

See docs/PUBLISHING.md for detailed publishing instructions.

Quick steps:

# 1. Update version
poetry version patch

# 2. Build
poetry build

# 3. Publish
poetry publish

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

💡 Use Cases

LLM Content Extractor is particularly useful for:

🤖 LLM Application Development: Extract structured data from model outputs
🔄 Data Pipelines: Clean and standardize AI-generated content
🧪 Testing Tools: Validate LLM output formats
📊 Data Processing: Batch process LLM responses

❓ FAQ

Q: Why is my JSON extraction failing?

A: Ensure the text contains valid JSON structure. This library tries multiple strategies, but cannot recover completely corrupted JSON.

Q: Can I extract multiple code blocks?

A: The current version extracts the first matching code block. To extract multiple blocks, use the extract_all_blocks() method on CodeBlockExtractor or call the function multiple times.

Q: Is there support for other formats?

A: Yes! You can add support for new formats by inheriting from ContentExtractor and registering it in the system.

Q: How do I enable strict mode?

A: Use the extractor classes directly:

extractor = JSONExtractor(strict=True)
result = extractor.extract(text)

🌟 Advanced Features

Language Detection

from llm_content_extractor.strategies import CodeBlockExtractor

extractor = CodeBlockExtractor()
code = "def hello(): return 'world'"
language = extractor.detect_language(code)  # Returns 'python'

Extract All Code Blocks

from llm_content_extractor.strategies import CodeBlockExtractor

extractor = CodeBlockExtractor()
blocks = extractor.extract_all_blocks(multi_code_text)
for block in blocks:
    print(f"{block['language']}: {block['code']}")

Validate XML/HTML

from llm_content_extractor.strategies import XMLExtractor, HTMLExtractor

xml_extractor = XMLExtractor()
is_valid = xml_extractor.is_valid_xml(xml_string)

html_extractor = HTMLExtractor()
is_valid = html_extractor.is_valid_html(html_string)

📚 Documentation

Architecture - Detailed architecture documentation
Publishing Guide - How to publish to PyPI
Examples - Usage examples

🙏 Acknowledgments

Thanks to all contributors and developers using this project!

📬 Contact

Report Issues: GitHub Issues
Feature Requests: GitHub Discussions

If this project helps you, please consider giving it a ⭐️!

Keywords

FAQs

What is llm-content-extractor?

Is llm-content-extractor well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

llm-content-extractor

LLM Content Extractor

✨ Features

📦 Installation

🚀 Quick Start

Basic Usage

XML Extraction Examples

HTML Extraction Examples

Code Block Extraction Examples

🎨 Advanced Usage

Using Extractor Classes Directly

Custom Extractors

Using Custom Extractor Instances

🧪 Fault Tolerance Features

1. Markdown Code Fences

2. Embedded Content

3. JSON Syntax Error Fixing

4. Nested Structures

🏗️ Architecture

📚 API Reference

extract(raw_text, content_type, language="", extractor=None)

ContentType Enum

Extractor Options

JSONExtractor

XMLExtractor

HTMLExtractor

CodeBlockExtractor

🔧 Development

Setup

Running Tests

📖 Publishing to PyPI

🤝 Contributing

📄 License

💡 Use Cases

❓ FAQ

🌟 Advanced Features

Language Detection

Extract All Code Blocks

Validate XML/HTML

📚 Documentation

🙏 Acknowledgments

📬 Contact

Keywords

Related posts

Rust RFC Proposes a Security Tab on crates.io for RustSec Advisories

Malicious Crate Mimicking ‘Finch’ Exfiltrates Credentials via a Hidden Dependency

`extract(raw_text, content_type, language="", extractor=None)`

`ContentType` Enum