
Security News
Deno 2.6 + Socket: Supply Chain Defense In Your CLI
Deno 2.6 introduces deno audit with a new --socket flag that plugs directly into Socket to bring supply chain security checks into the Deno CLI.
llm-content-extractor
Advanced tools
A robust content extractor for LLM outputs with support for JSON, XML, HTML, and code blocks
A robust content extractor for LLM outputs with support for extracting and parsing JSON, XML, HTML, and code blocks from raw strings.
Install with pip:
pip install llm-content-extractor
Install with Poetry:
poetry add llm-content-extractor
from llm_content_extractor import extract, ContentType
# Extract JSON
json_text = '''
Here's the data you requested:
```json
{
"name": "Alice",
"age": 30,
"hobbies": ["reading", "coding"],
}
'''
result = extract(json_text, ContentType.JSON) print(result) # {'name': 'Alice', 'age': 30, 'hobbies': ['reading', 'coding']}
### JSON Extraction Examples
```python
from llm_content_extractor import extract, ContentType
# 1. JSON with Markdown fence
text1 = '```json\n{"status": "success"}\n```'
extract(text1, ContentType.JSON) # {'status': 'success'}
# 2. Plain JSON
text2 = '{"status": "success"}'
extract(text2, ContentType.JSON) # {'status': 'success'}
# 3. JSON embedded in text
text3 = 'The result is: {"status": "success"} - done!'
extract(text3, ContentType.JSON) # {'status': 'success'}
# 4. JSON with trailing commas (common LLM error)
text4 = '{"items": [1, 2, 3,],}'
extract(text4, ContentType.JSON) # {'items': [1, 2, 3]}
# 5. Using string content type
extract(text1, "json") # Also works
Given text containing a fenced XML block:
A response from the LLM:
```xml
<root>
<item id="1">First</item>
<item id="2">Second</item>
</root>
You can extract it with:
```python
from llm_content_extractor import extract, ContentType
# Assuming the text above is in a variable `xml_text`
result = extract(xml_text, ContentType.XML)
print(result) # Returns cleaned XML string
Given text containing a fenced HTML block:
LLM says:
```html
<div class="container">
<h1>Title</h1>
<p>Content here</p>
</div>
You can extract it with:
```python
from llm_content_extractor import extract, ContentType
# Assuming the text above is in a variable `html_text`
result = extract(html_text, ContentType.HTML)
print(result) # Returns cleaned HTML string
1. Extract language-specific code
Given a Python code block:
```python
def greet(name):
return f"Hello, {name}!"
print(greet("World"))
Extract it by specifying the language:
```python
from llm_content_extractor import extract, ContentType
# Assuming the text above is in a variable `python_code_text`
code = extract(python_code_text, ContentType.CODE, language='python')
print(code)
# Output:
# def greet(name):
# return f"Hello, {name}!"
#
# print(greet("World"))
2. Extract any code block
Given a generic code block (no language specified):
const x = 42; console.log(x);
Extract it without specifying a language:
# Assuming the text above is in a variable `generic_code_text`
code = extract(generic_code_text, ContentType.CODE)
print(code) # const x = 42;\nconsole.log(x);
from llm_content_extractor import JSONExtractor, XMLExtractor
# Use extractor classes directly
json_extractor = JSONExtractor()
result = json_extractor.extract('{"key": "value"}')
xml_extractor = XMLExtractor()
result = xml_extractor.extract('<root><item>test</item></root>')
Create custom extractors by inheriting from the ContentExtractor base class:
from llm_content_extractor.base import ContentExtractor
from llm_content_extractor import extract, ContentType, register_extractor
import json
class CustomJSONExtractor(ContentExtractor):
def extract(self, raw_text: str):
# Custom extraction logic
cleaned = raw_text.strip()
# ... your logic here
return json.loads(cleaned)
# Register custom extractor
register_extractor(ContentType.JSON, CustomJSONExtractor)
# Use the custom extractor
result = extract(text, ContentType.JSON)
from llm_content_extractor import extract, JSONExtractor
# Create a custom configured extractor
my_extractor = JSONExtractor(strict=True)
# Pass the extractor instance directly
result = extract(raw_text, ContentType.JSON, extractor=my_extractor)
LLM Content Extractor handles various common issues in LLM outputs:
# ✅ Supports various fence formats
extract('```json\n{"a": 1}\n```', ContentType.JSON)
extract('```JSON\n{"a": 1}\n```', ContentType.JSON) # Uppercase
extract('```\n{"a": 1}\n```', ContentType.JSON) # No language identifier
# ✅ Extract content from surrounding text
text = '''
Here is the configuration:
{"enabled": true, "timeout": 30}
This will set the timeout to 30 seconds.
'''
extract(text, ContentType.JSON) # Successfully extracts
# ✅ Automatically fix trailing commas
extract('{"items": [1, 2,],}', ContentType.JSON) # {'items': [1, 2]}
extract('[{"id": 1,}, {"id": 2,}]', ContentType.JSON) # [{'id': 1}, {'id': 2}]
# ✅ Handle complex nested structures
nested = {
"user": {
"profile": {
"name": "Alice",
"contacts": ["email", "phone"]
}
}
}
# Fully supported
This project uses the Strategy Pattern:
ContentExtractor (Abstract Base Class)
├── JSONExtractor
├── XMLExtractor
├── HTMLExtractor
└── CodeBlockExtractor
This design provides:
extract(raw_text, content_type, language="", extractor=None)Main extraction function.
Parameters:
raw_text (str): Raw string output from LLMcontent_type (ContentType | str): Content type (JSON, XML, HTML, CODE)language (str, optional): For CODE type, specify the programming languageextractor (ContentExtractor, optional): Custom extractor instanceReturns:
dict or liststrRaises:
ValueError: If valid content cannot be extractedTypeError: If an invalid extractor is providedContentType Enumclass ContentType(Enum):
JSON = "json"
XML = "xml"
HTML = "html"
CODE = "code"
JSONExtractor(strict=False)
strict: If True, disable auto-fixing of errors like trailing commasXMLExtractor(validate=True, recover=True)
validate: If True and lxml is available, validate XML syntaxrecover: If True, attempt to recover from malformed XMLHTMLExtractor(validate=False, clean=False)
validate: If True, validate HTML structureclean: If True, clean and normalize HTMLCodeBlockExtractor(language="", strict=False)
language: Specific language to extract (e.g., 'python', 'javascript')strict: If True, only extract fenced code blocks# Clone the repository
git clone https://github.com/aihes/llm-content-extractor.git
cd llm-content-extractor
# Install dependencies
poetry install
# Run tests
poetry run pytest
# Format code
poetry run black .
# Type checking
poetry run mypy llm_content_extractor
# Run all tests
poetry run pytest
# With coverage report
poetry run pytest --cov=llm_content_extractor --cov-report=html
# Run specific tests
poetry run pytest tests/test_json_extractor.py
See docs/PUBLISHING.md for detailed publishing instructions.
Quick steps:
# 1. Update version
poetry version patch
# 2. Build
poetry build
# 3. Publish
poetry publish
Contributions are welcome! Please follow these steps:
git checkout -b feature/amazing-feature)git commit -m 'Add amazing feature')git push origin feature/amazing-feature)This project is licensed under the MIT License - see the LICENSE file for details.
LLM Content Extractor is particularly useful for:
Q: Why is my JSON extraction failing?
A: Ensure the text contains valid JSON structure. This library tries multiple strategies, but cannot recover completely corrupted JSON.
Q: Can I extract multiple code blocks?
A: The current version extracts the first matching code block. To extract multiple blocks, use the extract_all_blocks() method on CodeBlockExtractor or call the function multiple times.
Q: Is there support for other formats?
A: Yes! You can add support for new formats by inheriting from ContentExtractor and registering it in the system.
Q: How do I enable strict mode?
A: Use the extractor classes directly:
extractor = JSONExtractor(strict=True)
result = extractor.extract(text)
from llm_content_extractor.strategies import CodeBlockExtractor
extractor = CodeBlockExtractor()
code = "def hello(): return 'world'"
language = extractor.detect_language(code) # Returns 'python'
from llm_content_extractor.strategies import CodeBlockExtractor
extractor = CodeBlockExtractor()
blocks = extractor.extract_all_blocks(multi_code_text)
for block in blocks:
print(f"{block['language']}: {block['code']}")
from llm_content_extractor.strategies import XMLExtractor, HTMLExtractor
xml_extractor = XMLExtractor()
is_valid = xml_extractor.is_valid_xml(xml_string)
html_extractor = HTMLExtractor()
is_valid = html_extractor.is_valid_html(html_string)
Thanks to all contributors and developers using this project!
If this project helps you, please consider giving it a ⭐️!
FAQs
A robust content extractor for LLM outputs with support for JSON, XML, HTML, and code blocks
We found that llm-content-extractor demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Deno 2.6 introduces deno audit with a new --socket flag that plugs directly into Socket to bring supply chain security checks into the Deno CLI.

Security News
New DoS and source code exposure bugs in React Server Components and Next.js: what’s affected and how to update safely.

Security News
Socket CEO Feross Aboukhadijeh joins Software Engineering Daily to discuss modern software supply chain attacks and rising AI-driven security risks.