
A robust content extractor for LLM outputs with support for extracting and parsing JSON, XML, HTML, and code blocks from raw strings.
✨ Features
- 🎯 Multiple Format Support: Extract JSON, XML, HTML, and code blocks
- 🛡️ Fault Tolerant:
- Automatically handle Markdown code fences (```json ... ```)
- Intelligently extract content embedded in text
- Fix common LLM errors (e.g., trailing commas in JSON)
- 🏗️ Strategy Pattern: Easy to extend with custom extractors
- 📦 Simple API: Functional interface, ready to use
- 🧪 Well Tested: High test coverage for reliability
- 🔧 Type Safe: Full type annotations support
📦 Installation
Install with pip:
pip install llm-content-extractor
Install with Poetry:
poetry add llm-content-extractor
🚀 Quick Start
Basic Usage
from llm_content_extractor import extract, ContentType
json_text = '''
Here's the data you requested:
```json
{
"name": "Alice",
"age": 30,
"hobbies": ["reading", "coding"],
}
'''
result = extract(json_text, ContentType.JSON)
print(result) # {'name': 'Alice', 'age': 30, 'hobbies': ['reading', 'coding']}
### JSON Extraction Examples
```python
from llm_content_extractor import extract, ContentType
# 1. JSON with Markdown fence
text1 = '```json\n{"status": "success"}\n```'
extract(text1, ContentType.JSON) # {'status': 'success'}
# 2. Plain JSON
text2 = '{"status": "success"}'
extract(text2, ContentType.JSON) # {'status': 'success'}
# 3. JSON embedded in text
text3 = 'The result is: {"status": "success"} - done!'
extract(text3, ContentType.JSON) # {'status': 'success'}
# 4. JSON with trailing commas (common LLM error)
text4 = '{"items": [1, 2, 3,],}'
extract(text4, ContentType.JSON) # {'items': [1, 2, 3]}
# 5. Using string content type
extract(text1, "json") # Also works
Given text containing a fenced XML block:
A response from the LLM:
```xml
<root>
<item id="1">First</item>
<item id="2">Second</item>
</root>
You can extract it with:
```python
from llm_content_extractor import extract, ContentType
# Assuming the text above is in a variable `xml_text`
result = extract(xml_text, ContentType.XML)
print(result) # Returns cleaned XML string
Given text containing a fenced HTML block:
LLM says:
```html
<div class="container">
<h1>Title</h1>
<p>Content here</p>
</div>
You can extract it with:
```python
from llm_content_extractor import extract, ContentType
# Assuming the text above is in a variable `html_text`
result = extract(html_text, ContentType.HTML)
print(result) # Returns cleaned HTML string
1. Extract language-specific code
Given a Python code block:
```python
def greet(name):
return f"Hello, {name}!"
print(greet("World"))
Extract it by specifying the language:
```python
from llm_content_extractor import extract, ContentType
# Assuming the text above is in a variable `python_code_text`
code = extract(python_code_text, ContentType.CODE, language='python')
print(code)
# Output:
# def greet(name):
# return f"Hello, {name}!"
#
# print(greet("World"))
2. Extract any code block
Given a generic code block (no language specified):
const x = 42;
console.log(x);
Extract it without specifying a language:
code = extract(generic_code_text, ContentType.CODE)
print(code)
🎨 Advanced Usage
from llm_content_extractor import JSONExtractor, XMLExtractor
json_extractor = JSONExtractor()
result = json_extractor.extract('{"key": "value"}')
xml_extractor = XMLExtractor()
result = xml_extractor.extract('<root><item>test</item></root>')
Create custom extractors by inheriting from the ContentExtractor base class:
from llm_content_extractor.base import ContentExtractor
from llm_content_extractor import extract, ContentType, register_extractor
import json
class CustomJSONExtractor(ContentExtractor):
def extract(self, raw_text: str):
cleaned = raw_text.strip()
return json.loads(cleaned)
register_extractor(ContentType.JSON, CustomJSONExtractor)
result = extract(text, ContentType.JSON)
from llm_content_extractor import extract, JSONExtractor
my_extractor = JSONExtractor(strict=True)
result = extract(raw_text, ContentType.JSON, extractor=my_extractor)
🧪 Fault Tolerance Features
LLM Content Extractor handles various common issues in LLM outputs:
1. Markdown Code Fences
extract('```json\n{"a": 1}\n```', ContentType.JSON)
extract('```JSON\n{"a": 1}\n```', ContentType.JSON)
extract('```\n{"a": 1}\n```', ContentType.JSON)
2. Embedded Content
text = '''
Here is the configuration:
{"enabled": true, "timeout": 30}
This will set the timeout to 30 seconds.
'''
extract(text, ContentType.JSON)
3. JSON Syntax Error Fixing
extract('{"items": [1, 2,],}', ContentType.JSON)
extract('[{"id": 1,}, {"id": 2,}]', ContentType.JSON)
4. Nested Structures
nested = {
"user": {
"profile": {
"name": "Alice",
"contacts": ["email", "phone"]
}
}
}
🏗️ Architecture
This project uses the Strategy Pattern:
ContentExtractor (Abstract Base Class)
├── JSONExtractor
├── XMLExtractor
├── HTMLExtractor
└── CodeBlockExtractor
This design provides:
- ✅ Easy to add new extractor types
- ✅ Single responsibility for each extractor
- ✅ Flexible replacement and extension of extraction logic
📚 API Reference
Main extraction function.
Parameters:
raw_text (str): Raw string output from LLM
content_type (ContentType | str): Content type (JSON, XML, HTML, CODE)
language (str, optional): For CODE type, specify the programming language
extractor (ContentExtractor, optional): Custom extractor instance
Returns:
- JSON:
dict or list
- XML/HTML/CODE:
str
Raises:
ValueError: If valid content cannot be extracted
TypeError: If an invalid extractor is provided
ContentType Enum
class ContentType(Enum):
JSON = "json"
XML = "xml"
HTML = "html"
CODE = "code"
JSONExtractor(strict=False)
strict: If True, disable auto-fixing of errors like trailing commas
XMLExtractor(validate=True, recover=True)
validate: If True and lxml is available, validate XML syntax
recover: If True, attempt to recover from malformed XML
HTMLExtractor(validate=False, clean=False)
validate: If True, validate HTML structure
clean: If True, clean and normalize HTML
CodeBlockExtractor(language="", strict=False)
language: Specific language to extract (e.g., 'python', 'javascript')
strict: If True, only extract fenced code blocks
🔧 Development
Setup
git clone https://github.com/aihes/llm-content-extractor.git
cd llm-content-extractor
poetry install
poetry run pytest
poetry run black .
poetry run mypy llm_content_extractor
Running Tests
poetry run pytest
poetry run pytest --cov=llm_content_extractor --cov-report=html
poetry run pytest tests/test_json_extractor.py
📖 Publishing to PyPI
See docs/PUBLISHING.md for detailed publishing instructions.
Quick steps:
poetry version patch
poetry build
poetry publish
🤝 Contributing
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature)
- Commit your changes (
git commit -m 'Add amazing feature')
- Push to the branch (
git push origin feature/amazing-feature)
- Open a Pull Request
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
💡 Use Cases
LLM Content Extractor is particularly useful for:
- 🤖 LLM Application Development: Extract structured data from model outputs
- 🔄 Data Pipelines: Clean and standardize AI-generated content
- 🧪 Testing Tools: Validate LLM output formats
- 📊 Data Processing: Batch process LLM responses
❓ FAQ
Q: Why is my JSON extraction failing?
A: Ensure the text contains valid JSON structure. This library tries multiple strategies, but cannot recover completely corrupted JSON.
Q: Can I extract multiple code blocks?
A: The current version extracts the first matching code block. To extract multiple blocks, use the extract_all_blocks() method on CodeBlockExtractor or call the function multiple times.
Q: Is there support for other formats?
A: Yes! You can add support for new formats by inheriting from ContentExtractor and registering it in the system.
Q: How do I enable strict mode?
A: Use the extractor classes directly:
extractor = JSONExtractor(strict=True)
result = extractor.extract(text)
🌟 Advanced Features
Language Detection
from llm_content_extractor.strategies import CodeBlockExtractor
extractor = CodeBlockExtractor()
code = "def hello(): return 'world'"
language = extractor.detect_language(code)
from llm_content_extractor.strategies import CodeBlockExtractor
extractor = CodeBlockExtractor()
blocks = extractor.extract_all_blocks(multi_code_text)
for block in blocks:
print(f"{block['language']}: {block['code']}")
Validate XML/HTML
from llm_content_extractor.strategies import XMLExtractor, HTMLExtractor
xml_extractor = XMLExtractor()
is_valid = xml_extractor.is_valid_xml(xml_string)
html_extractor = HTMLExtractor()
is_valid = html_extractor.is_valid_html(html_string)
📚 Documentation
🙏 Acknowledgments
Thanks to all contributors and developers using this project!
📬 Contact
If this project helps you, please consider giving it a ⭐️!