A TypeScript translation of the original Python LangExtract library by Google LLC. This library provides structured information extraction from text using Large Language Models (LLMs) with full TypeScript support, comprehensive visualization tools, and a powerful CLI interface.
Original Repository: google/langextract
Features
- Structured Information Extraction: Extract entities, relationships, and structured data from text
- Multiple LLM Support: Works with Google Gemini, OpenAI, Ollama, and other LLM providers
- Schema Generation: Automatically generates JSON schemas from examples for better extraction
- Text Alignment: Aligns extracted information with original text positions
- Interactive Visualization: Built-in HTML visualization with animations and controls
- Command Line Interface: CLI tool for easy visualization generation
- Batch Processing: Process multiple documents efficiently
- TypeScript Support: Full TypeScript types and interfaces
- Flexible Output Formats: Support for JSON and YAML output formats
- Error Handling: Robust error handling and validation
Installation
npm install langextract
git clone https://github.com/kmbro/langextract.git
cd langextract/typescript
npm install
npm run build
Quick Start
import { extract, ExampleData } from "langextract";
const examples: ExampleData[] = [
{
text: "John Smith is 30 years old and works at Google.",
extractions: [
{
extractionClass: "person",
extractionText: "John Smith",
attributes: {
age: "30",
employer: "Google",
},
},
],
},
];
async function extractPersonInfo() {
const result = await extract("Alice Johnson is 25 and works at Microsoft.", {
promptDescription: "Extract person information including name, age, and employer",
examples: examples,
modelType: "gemini",
apiKey: "your-gemini-api-key",
modelId: "gemini-2.5-flash",
});
console.log(result.extractions);
}
async function extractPersonInfoWithOpenAI() {
const result = await extract("Alice Johnson is 25 and works at Microsoft.", {
promptDescription: "Extract person information including name, age, and employer",
examples: examples,
modelType: "openai",
apiKey: "your-openai-api-key",
modelId: "gpt-4o-mini",
temperature: 0.1,
});
console.log(result.extractions);
}
### Quick Visualization
```typescript
import { visualize, saveVisualizationPage } from "langextract";
// Generate and save visualization
saveVisualizationPage(result, "./extraction-viz.html", {
animationSpeed: 1.0,
showLegend: true,
gifOptimized: true
});
API Reference
Main Functions
The main function for extracting structured information from text.
Parameters:
textOrDocuments
: string | Document | Document[]
- Text or document(s) to process
options
: Extraction options object
Returns: Promise<AnnotatedDocument | AnnotatedDocument[]>
Options:
promptDescription
: string
- Instructions for what to extract
examples
: ExampleData[]
- Training examples to guide extraction
modelId
: string
- LLM model ID (default: "gemini-2.5-flash")
modelType
: "gemini" | "openai" | "ollama"
- LLM provider type (default: "gemini")
apiKey
: string
- API key for the LLM service
formatType
: FormatType
- Output format (JSON or YAML)
maxCharBuffer
: number
- Maximum characters per chunk (default: 1000)
temperature
: number
- Sampling temperature (default: 0.5)
fenceOutput
: boolean
- Whether to expect fenced output (default: false)
useSchemaConstraints
: boolean
- Use schema constraints (default: true)
batchLength
: number
- Documents per batch (default: 10)
maxWorkers
: number
- Maximum parallel workers (default: 10)
additionalContext
: string
- Additional context for extraction
debug
: boolean
- Enable debug mode (default: true)
modelUrl
: string
- Custom model URL (for Ollama and Gemini)
baseURL
: string
- Custom base URL (for OpenAI)
extractionPasses
: number
- Number of extraction passes (default: 1)
maxTokens
: number
- Maximum tokens in the response (default: 2048)
Core Types
ExampleData
interface ExampleData {
text: string;
extractions: Extraction[];
}
CharInterval
interface CharInterval {
startPos?: number;
endPos?: number;
}
interface Extraction {
extractionClass: string;
extractionText: string;
charInterval?: CharInterval;
alignmentStatus?: AlignmentStatus;
extractionIndex?: number;
groupIndex?: number;
description?: string;
attributes?: Record<string, string | string[]>;
tokenInterval?: TokenInterval;
}
Document
interface Document {
text: string;
documentId?: string;
additionalContext?: string;
tokenizedText?: TokenizedText;
}
AnnotatedDocument
interface AnnotatedDocument {
documentId?: string;
extractions?: Extraction[];
text?: string;
tokenizedText?: TokenizedText;
}
Visualization Functions
visualize(dataSource, options)
Generate interactive HTML visualization from extractions.
Parameters:
dataSource
: AnnotatedDocument | string
- Document or file path
options
: VisualizationOptions
- Visualization configuration
Returns: string
- HTML content
saveVisualizationPage(dataSource, outputPath, options)
Save complete HTML page with visualization.
Parameters:
dataSource
: AnnotatedDocument | string
- Document or file path
outputPath
: string
- Output file path
options
: VisualizationOptions
- Visualization configuration
Advanced Usage
Model Configuration
Google Gemini
import { GeminiLanguageModel } from "langextract";
const model = new GeminiLanguageModel({
modelId: "gemini-2.5-flash",
apiKey: "your-api-key",
temperature: 0.3,
});
OpenAI
import { OpenAILanguageModel } from "langextract";
const model = new OpenAILanguageModel({
model: "gpt-4o-mini",
apiKey: "your-openai-api-key",
temperature: 0.3,
baseURL: "https://api.openai.com/v1",
});
Ollama (Local Models)
import { OllamaLanguageModel } from "langextract";
const model = new OllamaLanguageModel({
model: "llama2:latest",
modelUrl: "http://localhost:11434",
temperature: 0.7,
});
Response Control
Limiting Response Length with maxTokens
You can control the maximum number of tokens in the model's response using the maxTokens
option:
const result = await extract("Extract person information from this text.", {
examples: examples,
apiKey: "your-api-key",
maxTokens: 100,
});
const result = await extract("Extract person information from this text.", {
examples: examples,
modelType: "openai",
apiKey: "your-openai-api-key",
maxTokens: 200,
});
const result = await extract("Extract person information from this text.", {
examples: examples,
modelType: "ollama",
modelUrl: "http://localhost:11434",
maxTokens: 150,
});
Custom Model URLs
You can override the default API endpoints for custom deployments:
const result = await extract("Extract person information from this text.", {
examples: examples,
apiKey: "your-api-key",
modelType: "gemini",
modelUrl: "https://your-custom-gemini-endpoint.com",
maxTokens: 500,
});
const result = await extract("Extract person information from this text.", {
examples: examples,
modelType: "openai",
apiKey: "your-openai-api-key",
baseURL: "https://your-custom-openai-endpoint.com",
maxTokens: 300,
});
const result = await extract("Extract person information from this text.", {
examples: examples,
modelType: "ollama",
modelUrl: "http://your-custom-ollama-server:11434",
maxTokens: 200,
});
Prompt Engineering
Custom Prompt Templates
import { PromptTemplateStructured, QAPromptGeneratorImpl } from "langextract";
const template: PromptTemplateStructured = {
description: "Extract medical entities from clinical text",
examples: [
{
text: "Patient has diabetes and hypertension",
extractions: [
{
extractionClass: "condition",
extractionText: "diabetes",
},
{
extractionClass: "condition",
extractionText: "hypertension",
},
],
},
],
};
const generator = new QAPromptGeneratorImpl(template);
const prompt = generator.render("Patient shows signs of asthma");
Output Processing
Custom Resolvers
import { Resolver, FormatType } from "langextract";
const resolver = new Resolver({
fenceOutput: true,
formatType: FormatType.YAML,
extractionAttributesSuffix: "_attrs",
});
Schema Enforcement
OpenAI models support JSON schema enforcement through function calling. When you provide a schema, the model will be forced to return responses that conform to the specified structure:
import { OpenAILanguageModel, GeminiSchemaImpl } from "langextract";
const bookSchema = new GeminiSchemaImpl({
type: "object",
properties: {
title: { type: "string" },
author: { type: "string" },
publication_year: { type: "number" },
genre: { type: "string" },
},
required: ["title", "author"],
});
const model = new OpenAILanguageModel({
model: "gpt-4o-mini",
apiKey: "your-openai-api-key",
openAISchema: bookSchema,
formatType: FormatType.JSON,
temperature: 0.0,
});
Performance Optimization
Batch Processing
import { Document } from "langextract";
const documents: Document[] = [
{ text: "First document text", documentId: "doc1" },
{ text: "Second document text", documentId: "doc2" },
];
const results = await extract(documents, {
examples: examples,
apiKey: "your-api-key",
batchLength: 5,
});
Examples
Use Cases
const medicalExamples: ExampleData[] = [
{
text: "The patient has diabetes mellitus type 2 and hypertension.",
extractions: [
{
extractionClass: "condition",
extractionText: "diabetes mellitus type 2",
attributes: {
severity: "moderate",
type: "type 2",
},
},
{
extractionClass: "condition",
extractionText: "hypertension",
attributes: {
severity: "mild",
},
},
],
},
];
const result = await extract("Patient diagnosed with asthma and obesity.", {
promptDescription: "Extract medical conditions and their attributes",
examples: medicalExamples,
apiKey: "your-api-key",
});
Named Entity Recognition
const nerExamples: ExampleData[] = [
{
text: "Apple Inc. was founded by Steve Jobs in Cupertino, California.",
extractions: [
{
extractionClass: "organization",
extractionText: "Apple Inc.",
attributes: {
type: "company",
},
},
{
extractionClass: "person",
extractionText: "Steve Jobs",
attributes: {
role: "founder",
},
},
{
extractionClass: "location",
extractionText: "Cupertino, California",
attributes: {
type: "city",
},
},
],
},
];
Visualization
LangExtract provides powerful visualization capabilities to help you understand and analyze your extractions. The visualization creates interactive HTML that highlights extracted entities with animations and controls.
Features
- Interactive Controls: Play/pause, next/previous, and progress slider
- Color-coded Highlights: Each extraction class gets a unique color
- Attribute Display: Shows extraction attributes in a side panel
- Smooth Animations: Automatic highlighting with configurable speed
- GIF Optimization: Special styling for video capture and screenshots
- Responsive Design: Works on different screen sizes
- File Support: Load from JSONL files or AnnotatedDocument objects
Basic Visualization
import { visualize, saveVisualizationPage } from "langextract";
const html = visualize(result, {
animationSpeed: 1.0,
showLegend: true,
gifOptimized: true,
});
saveVisualizationPage(result, "./extraction-visualization.html", {
animationSpeed: 1.5,
showLegend: true,
gifOptimized: false,
});
Visualization Options
interface VisualizationOptions {
animationSpeed?: number;
showLegend?: boolean;
gifOptimized?: boolean;
contextChars?: number;
}
Loading from Files
const html = visualize("./extractions.jsonl", {
animationSpeed: 0.8,
showLegend: true,
});
Command Line Interface
LangExtract provides a CLI tool for easy visualization generation:
npx ts-node bin/visualize.ts input.jsonl output.html
npx ts-node bin/visualize.ts input.jsonl output.html --speed 1.5 --gif-optimized
npx ts-node bin/visualize.ts input.jsonl output.html --no-legend
npm run visualize -- input.jsonl output.html --speed 0.8
CLI Options:
--speed <number>
: Animation speed in seconds (default: 1.0)
--no-legend
: Hide the color legend
--gif-optimized
: Optimize styling for GIF/video capture
--context <number>
: Context characters around extractions (default: 150)
--help
: Show help message
Examples
npx ts-node bin/visualize.ts extractions.jsonl demo.html --speed 0.5 --gif-optimized
npx ts-node bin/visualize.ts extractions.jsonl presentation.html --speed 2.0 --no-legend
for file in *.jsonl; do
npx ts-node bin/visualize.ts "$file" "${file%.jsonl}.html"
done
Error Handling
LangExtract provides comprehensive error handling for various scenarios:
try {
const result = await extract(text, {
examples: examples,
apiKey: "your-api-key",
});
} catch (error) {
if (error instanceof Error) {
console.error("Extraction failed:", error.message);
}
}
Common Error Types
- Missing API Key: Ensure your API key is provided via parameter or environment variable
- Invalid Examples: Examples array must contain valid ExampleData objects
- Model Errors: Check model ID and API key for the specified provider
- File Not Found: Verify file paths for JSONL input files
- Invalid Character Positions: Ensure charInterval positions are within text bounds
Configuration
Environment Variables
Set your API key as an environment variable:
export LANGEXTRACT_API_KEY="your-api-key"
TypeScript Configuration
Add to your tsconfig.json
:
{
"compilerOptions": {
"esModuleInterop": true,
"allowSyntheticDefaultImports": true,
"target": "ES2020",
"module": "commonjs",
"strict": true,
"declaration": true,
"outDir": "./dist"
}
}
Development Setup
git clone https://github.com/kmbro/langextract.git
cd langextract/typescript
npm install
npm run build
npm test
OPENAI_API_KEY=your-api-key npm test -- medical-extraction.test.ts
npm run visualize -- sample-extractions.jsonl output.html
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
License
Apache 2.0 License - see LICENSE file for details.
Support
For issues and questions, please open an issue on the GitHub repository.