
Research
/Security News
Critical Vulnerability in NestJS Devtools: Localhost RCE via Sandbox Escape
A flawed sandbox in @nestjs/devtools-integration lets attackers run code on your machine via CSRF, leading to full Remote Code Execution (RCE).
LangExtract is a Python library that uses LLMs to extract structured information from unstructured text documents based on user-defined instructions. It processes materials such as clinical notes or reports, identifying and organizing key details while ensuring the extracted data corresponds to the source text.
Note: Using cloud-hosted models like Gemini requires an API key. See the API Key Setup section for instructions on how to get and configure your key.
Extract structured information with just a few lines of code.
First, create a prompt that clearly describes what you want to extract. Then, provide a high-quality example to guide the model.
import langextract as lx
import textwrap
# 1. Define the prompt and extraction rules
prompt = textwrap.dedent("""\
Extract characters, emotions, and relationships in order of appearance.
Use exact text for extractions. Do not paraphrase or overlap entities.
Provide meaningful attributes for each entity to add context.""")
# 2. Provide a high-quality example to guide the model
examples = [
lx.data.ExampleData(
text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
extractions=[
lx.data.Extraction(
extraction_class="character",
extraction_text="ROMEO",
attributes={"emotional_state": "wonder"}
),
lx.data.Extraction(
extraction_class="emotion",
extraction_text="But soft!",
attributes={"feeling": "gentle awe"}
),
lx.data.Extraction(
extraction_class="relationship",
extraction_text="Juliet is the sun",
attributes={"type": "metaphor"}
),
]
)
]
Provide your input text and the prompt materials to the lx.extract
function.
# The input text to be processed
input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"
# Run the extraction
result = lx.extract(
text_or_documents=input_text,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash",
)
Model Selection:
gemini-2.5-flash
is the recommended default, offering an excellent balance of speed, cost, and quality. For highly complex tasks requiring deeper reasoning,gemini-2.5-pro
may provide superior results. For large-scale or production use, a Tier 2 Gemini quota is suggested to increase throughput and avoid rate limits. See the rate-limit documentation for details.Model Lifecycle: Note that Gemini models have a lifecycle with defined retirement dates. Users should consult the official model version documentation to stay informed about the latest stable and legacy versions.
The extractions can be saved to a .jsonl
file, a popular format for working with language model data. LangExtract can then generate an interactive HTML visualization from this file to review the entities in context.
# Save the results to a JSONL file
lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl", output_dir=".")
# Generate the visualization from the file
html_content = lx.visualize("extraction_results.jsonl")
with open("visualization.html", "w") as f:
f.write(html_content)
This creates an animated and interactive HTML file:
Note on LLM Knowledge Utilization: This example demonstrates extractions that stay close to the text evidence - extracting "longing" for Lady Juliet's emotional state and identifying "yearning" from "gazed longingly at the stars." The task could be modified to generate attributes that draw more heavily from the LLM's world knowledge (e.g., adding
"identity": "Capulet family daughter"
or"literary_context": "tragic heroine"
). The balance between text-evidence and knowledge-inference is controlled by your prompt instructions and example attributes.
For larger texts, you can process entire documents directly from URLs with parallel processing and enhanced sensitivity:
# Process Romeo & Juliet directly from Project Gutenberg
result = lx.extract(
text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt",
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash",
extraction_passes=3, # Improves recall through multiple passes
max_workers=20, # Parallel processing for speed
max_char_buffer=1000 # Smaller contexts for better accuracy
)
This approach can extract hundreds of entities from full novels while maintaining high accuracy. The interactive visualization seamlessly handles large result sets, making it easy to explore hundreds of entities from the output JSONL file. See the full Romeo and Juliet extraction example → for detailed results and performance insights.
pip install langextract
Recommended for most users. For isolated environments, consider using a virtual environment:
python -m venv langextract_env
source langextract_env/bin/activate # On Windows: langextract_env\Scripts\activate
pip install langextract
LangExtract uses modern Python packaging with pyproject.toml
for dependency management:
Installing with -e
puts the package in development mode, allowing you to modify the code without reinstalling.
git clone https://github.com/google/langextract.git
cd langextract
# For basic installation:
pip install -e .
# For development (includes linting tools):
pip install -e ".[dev]"
# For testing (includes pytest):
pip install -e ".[test]"
docker build -t langextract .
docker run --rm -e LANGEXTRACT_API_KEY="your-api-key" langextract python your_script.py
When using LangExtract with cloud-hosted models (like Gemini or OpenAI), you'll need to set up an API key. On-device models don't require an API key. For developers using local LLMs, LangExtract offers built-in support for Ollama and can be extended to other third-party APIs by updating the inference endpoints.
Get API keys from:
Option 1: Environment Variable
export LANGEXTRACT_API_KEY="your-api-key-here"
Option 2: .env File (Recommended)
Add your API key to a .env
file:
# Add API key to .env file
cat >> .env << 'EOF'
LANGEXTRACT_API_KEY=your-api-key-here
EOF
# Keep your API key secure
echo '.env' >> .gitignore
In your Python code:
import langextract as lx
result = lx.extract(
text_or_documents=input_text,
prompt_description="Extract information...",
examples=[...],
model_id="gemini-2.5-flash"
)
Option 3: Direct API Key (Not Recommended for Production)
You can also provide the API key directly in your code, though this is not recommended for production use:
result = lx.extract(
text_or_documents=input_text,
prompt_description="Extract information...",
examples=[...],
model_id="gemini-2.5-flash",
api_key="your-api-key-here" # Only use this for testing/development
)
LangExtract also supports OpenAI models. Example OpenAI configuration:
from langextract.inference import OpenAILanguageModel
result = lx.extract(
text_or_documents=input_text,
prompt_description=prompt,
examples=examples,
language_model_type=OpenAILanguageModel,
model_id="gpt-4o",
api_key=os.environ.get('OPENAI_API_KEY'),
fence_output=True,
use_schema_constraints=False
)
Note: OpenAI models require fence_output=True
and use_schema_constraints=False
because LangExtract doesn't implement schema constraints for OpenAI yet.
Additional examples of LangExtract in action:
LangExtract can process complete documents directly from URLs. This example demonstrates extraction from the full text of Romeo and Juliet from Project Gutenberg (147,843 characters), showing parallel processing, sequential extraction passes, and performance optimization for long document processing.
View Romeo and Juliet Full Text Example →
Disclaimer: This demonstration is for illustrative purposes of LangExtract's baseline capability only. It does not represent a finished or approved product, is not intended to diagnose or suggest treatment of any disease or condition, and should not be used for medical advice.
LangExtract excels at extracting structured medical information from clinical text. These examples demonstrate both basic entity recognition (medication names, dosages, routes) and relationship extraction (connecting medications to their attributes), showing LangExtract's effectiveness for healthcare applications.
Explore RadExtract, a live interactive demo on HuggingFace Spaces that shows how LangExtract can automatically structure radiology reports. Try it directly in your browser with no setup required.
Contributions are welcome! See CONTRIBUTING.md to get started with development, testing, and pull requests. You must sign a Contributor License Agreement before submitting patches.
To run tests locally from the source:
# Clone the repository
git clone https://github.com/google/langextract.git
cd langextract
# Install with test dependencies
pip install -e ".[test]"
# Run all tests
pytest tests
Or reproduce the full CI matrix locally with tox:
tox # runs pylint + pytest on Python 3.10 and 3.11
This project uses automated formatting tools to maintain consistent code style:
# Auto-format all code
./autoformat.sh
# Or run formatters separately
isort langextract tests --profile google --line-length 80
pyink langextract tests --config pyproject.toml
For automatic formatting checks:
pre-commit install # One-time setup
pre-commit run --all-files # Manual run
Run linting before submitting PRs:
pylint --rcfile=.pylintrc langextract tests
See CONTRIBUTING.md for full development guidelines.
This is not an officially supported Google product. If you use LangExtract in production or publications, please cite accordingly and acknowledge usage. Use is subject to the Apache 2.0 License. For health-related applications, use of LangExtract is also subject to the Health AI Developer Foundations Terms of Use.
Happy Extracting!
FAQs
LangExtract: A library for extracting structured data from language models
We found that langextract demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
/Security News
A flawed sandbox in @nestjs/devtools-integration lets attackers run code on your machine via CSRF, leading to full Remote Code Execution (RCE).
Product
Customize license detection with Socket’s new license overlays: gain control, reduce noise, and handle edge cases with precision.
Product
Socket now supports Rust and Cargo, offering package search for all users and experimental SBOM generation for enterprise projects.