A TypeScript translation of the original Python LangExtract library by Google LLC. This library provides structured information extraction from text using Large Language Models (LLMs) with full TypeScript support, comprehensive visualization tools, and a powerful CLI interface.

Original Repository: google/langextract

Features

Structured Information Extraction: Extract entities, relationships, and structured data from text
Multiple LLM Support: Works with Google Gemini, OpenAI, Ollama, and other LLM providers
Schema Generation: Automatically generates JSON schemas from examples for better extraction
Text Alignment: Aligns extracted information with original text positions
Interactive Visualization: Built-in HTML visualization with animations and controls
Command Line Interface: CLI tool for easy visualization generation
Batch Processing: Process multiple documents efficiently
TypeScript Support: Full TypeScript types and interfaces
Flexible Output Formats: Support for JSON and YAML output formats
Error Handling: Robust error handling and validation

Installation

# Install from npm
npm install langextract

# Or install from source
git clone https://github.com/kmbro/langextract.git
cd langextract/typescript
npm install
npm run build

Quick Start

Basic Extraction

import { extract, ExampleData } from "langextract";

// Define examples to guide the extraction
const examples: ExampleData[] = [
  {
    text: "John Smith is 30 years old and works at Google.",
    extractions: [
      {
        extractionClass: "person",
        extractionText: "John Smith",
        attributes: {
          age: "30",
          employer: "Google",
        },
      },
    ],
  },
];

// Extract information from text using Gemini
async function extractPersonInfo() {
  const result = await extract("Alice Johnson is 25 and works at Microsoft.", {
    promptDescription: "Extract person information including name, age, and employer",
    examples: examples,
    modelType: "gemini",
    apiKey: "your-gemini-api-key",
    modelId: "gemini-2.5-flash",
  });

  console.log(result.extractions);
  // Output: [
  //   {
  //     extractionClass: "person",
  //     extractionText: "Alice Johnson",
  //     attributes: {
  //       age: "25",
  //       employer: "Microsoft"
  //     },
  //     charInterval: { startPos: 0, endPos: 13 },
  //     alignmentStatus: "match_exact"
  //   }
  // ]
}

// Extract information from text using OpenAI
async function extractPersonInfoWithOpenAI() {
  const result = await extract("Alice Johnson is 25 and works at Microsoft.", {
    promptDescription: "Extract person information including name, age, and employer",
    examples: examples,
    modelType: "openai",
    apiKey: "your-openai-api-key",
    modelId: "gpt-4o-mini",
    temperature: 0.1,
  });

  console.log(result.extractions);
}

### Quick Visualization

```typescript
import { visualize, saveVisualizationPage } from "langextract";

// Generate and save visualization
saveVisualizationPage(result, "./extraction-viz.html", {
  animationSpeed: 1.0,
  showLegend: true,
  gifOptimized: true
});

API Reference

Main Functions

`extract(textOrDocuments, options)`

The main function for extracting structured information from text.

Parameters:

textOrDocuments: string | Document | Document[] - Text or document(s) to process
options: Extraction options object

Returns: Promise<AnnotatedDocument | AnnotatedDocument[]>

Options:

promptDescription: string - Instructions for what to extract
examples: ExampleData[] - Training examples to guide extraction
modelId: string - LLM model ID (default: "gemini-2.5-flash")
modelType: "gemini" | "openai" | "ollama" - LLM provider type (default: "gemini")
apiKey: string - API key for the LLM service
formatType: FormatType - Output format (JSON or YAML)
maxCharBuffer: number - Maximum characters per chunk (default: 1000)
temperature: number - Sampling temperature (default: 0.5)
fenceOutput: boolean - Whether to expect fenced output (default: false)
useSchemaConstraints: boolean - Use schema constraints (default: true)
batchLength: number - Documents per batch (default: 10)
maxWorkers: number - Maximum parallel workers (default: 10)
additionalContext: string - Additional context for extraction
debug: boolean - Enable debug mode (default: true)
modelUrl: string - Custom model URL (for Ollama and Gemini)
baseURL: string - Custom base URL (for OpenAI)
extractionPasses: number - Number of extraction passes (default: 1)
maxTokens: number - Maximum tokens in the response (default: 2048)

Core Types

`ExampleData`

interface ExampleData {
  text: string;
  extractions: Extraction[];
}

`CharInterval`

interface CharInterval {
  startPos?: number;
  endPos?: number;
}

`Extraction`

interface Extraction {
  extractionClass: string;
  extractionText: string;
  charInterval?: CharInterval;
  alignmentStatus?: AlignmentStatus;
  extractionIndex?: number;
  groupIndex?: number;
  description?: string;
  attributes?: Record<string, string | string[]>;
  tokenInterval?: TokenInterval;
}

`Document`

interface Document {
  text: string;
  documentId?: string;
  additionalContext?: string;
  tokenizedText?: TokenizedText;
}

`AnnotatedDocument`

interface AnnotatedDocument {
  documentId?: string;
  extractions?: Extraction[];
  text?: string;
  tokenizedText?: TokenizedText;
}

Visualization Functions

`visualize(dataSource, options)`

Generate interactive HTML visualization from extractions.

Parameters:

dataSource: AnnotatedDocument | string - Document or file path
options: VisualizationOptions - Visualization configuration

Returns: string - HTML content

`saveVisualizationPage(dataSource, outputPath, options)`

Save complete HTML page with visualization.

Parameters:

dataSource: AnnotatedDocument | string - Document or file path
outputPath: string - Output file path
options: VisualizationOptions - Visualization configuration

Advanced Usage

Model Configuration

Google Gemini

import { GeminiLanguageModel } from "langextract";

const model = new GeminiLanguageModel({
  modelId: "gemini-2.5-flash",
  apiKey: "your-api-key",
  temperature: 0.3,
});

OpenAI

import { OpenAILanguageModel } from "langextract";

const model = new OpenAILanguageModel({
  model: "gpt-4o-mini", // or "gpt-4", "gpt-3.5-turbo", etc.
  apiKey: "your-openai-api-key",
  temperature: 0.3,
  baseURL: "https://api.openai.com/v1", // Optional: for custom endpoints
});

Ollama (Local Models)

import { OllamaLanguageModel } from "langextract";

const model = new OllamaLanguageModel({
  model: "llama2:latest",
  modelUrl: "http://localhost:11434",
  temperature: 0.7,
});

Response Control

Limiting Response Length with maxTokens

You can control the maximum number of tokens in the model's response using the maxTokens option:

// Limit Gemini response to 100 tokens
const result = await extract("Extract person information from this text.", {
  examples: examples,
  apiKey: "your-api-key",
  maxTokens: 100, // Short, concise responses
});

// Limit OpenAI response to 200 tokens
const result = await extract("Extract person information from this text.", {
  examples: examples,
  modelType: "openai",
  apiKey: "your-openai-api-key",
  maxTokens: 200, // Moderate response length
});

// Limit Ollama response to 150 tokens
const result = await extract("Extract person information from this text.", {
  examples: examples,
  modelType: "ollama",
  modelUrl: "http://localhost:11434",
  maxTokens: 150, // Local model with token limit
});

Custom Model URLs

You can override the default API endpoints for custom deployments:

// Use custom Gemini endpoint (useful for self-hosted instances)
const result = await extract("Extract person information from this text.", {
  examples: examples,
  apiKey: "your-api-key",
  modelType: "gemini",
  modelUrl: "https://your-custom-gemini-endpoint.com", // Custom URL
  maxTokens: 500,
});

// Use custom OpenAI endpoint
const result = await extract("Extract person information from this text.", {
  examples: examples,
  modelType: "openai",
  apiKey: "your-openai-api-key",
  baseURL: "https://your-custom-openai-endpoint.com", // Custom base URL
  maxTokens: 300,
});

// Use custom Ollama endpoint
const result = await extract("Extract person information from this text.", {
  examples: examples,
  modelType: "ollama",
  modelUrl: "http://your-custom-ollama-server:11434", // Custom Ollama server
  maxTokens: 200,
});

Prompt Engineering

Custom Prompt Templates

import { PromptTemplateStructured, QAPromptGeneratorImpl } from "langextract";

const template: PromptTemplateStructured = {
  description: "Extract medical entities from clinical text",
  examples: [
    {
      text: "Patient has diabetes and hypertension",
      extractions: [
        {
          extractionClass: "condition",
          extractionText: "diabetes",
        },
        {
          extractionClass: "condition",
          extractionText: "hypertension",
        },
      ],
    },
  ],
};

const generator = new QAPromptGeneratorImpl(template);
const prompt = generator.render("Patient shows signs of asthma");

Output Processing

Custom Resolvers

import { Resolver, FormatType } from "langextract";

const resolver = new Resolver({
  fenceOutput: true,
  formatType: FormatType.YAML,
  extractionAttributesSuffix: "_attrs",
});

Schema Enforcement

OpenAI models support JSON schema enforcement through function calling. When you provide a schema, the model will be forced to return responses that conform to the specified structure:

import { OpenAILanguageModel, GeminiSchemaImpl } from "langextract";

// Create a custom schema
const bookSchema = new GeminiSchemaImpl({
  type: "object",
  properties: {
    title: { type: "string" },
    author: { type: "string" },
    publication_year: { type: "number" },
    genre: { type: "string" },
  },
  required: ["title", "author"],
});

const model = new OpenAILanguageModel({
  model: "gpt-4o-mini",
  apiKey: "your-openai-api-key",
  openAISchema: bookSchema, // This enforces the schema
  formatType: FormatType.JSON,
  temperature: 0.0,
});

Performance Optimization

Batch Processing

import { Document } from "langextract";

const documents: Document[] = [
  { text: "First document text", documentId: "doc1" },
  { text: "Second document text", documentId: "doc2" },
];

const results = await extract(documents, {
  examples: examples,
  apiKey: "your-api-key",
  batchLength: 5,
});

Examples

Use Cases

Medical Entity Extraction

const medicalExamples: ExampleData[] = [
  {
    text: "The patient has diabetes mellitus type 2 and hypertension.",
    extractions: [
      {
        extractionClass: "condition",
        extractionText: "diabetes mellitus type 2",
        attributes: {
          severity: "moderate",
          type: "type 2",
        },
      },
      {
        extractionClass: "condition",
        extractionText: "hypertension",
        attributes: {
          severity: "mild",
        },
      },
    ],
  },
];

const result = await extract("Patient diagnosed with asthma and obesity.", {
  promptDescription: "Extract medical conditions and their attributes",
  examples: medicalExamples,
  apiKey: "your-api-key",
});

Named Entity Recognition

const nerExamples: ExampleData[] = [
  {
    text: "Apple Inc. was founded by Steve Jobs in Cupertino, California.",
    extractions: [
      {
        extractionClass: "organization",
        extractionText: "Apple Inc.",
        attributes: {
          type: "company",
        },
      },
      {
        extractionClass: "person",
        extractionText: "Steve Jobs",
        attributes: {
          role: "founder",
        },
      },
      {
        extractionClass: "location",
        extractionText: "Cupertino, California",
        attributes: {
          type: "city",
        },
      },
    ],
  },
];

Visualization

LangExtract provides powerful visualization capabilities to help you understand and analyze your extractions. The visualization creates interactive HTML that highlights extracted entities with animations and controls.

Features

Interactive Controls: Play/pause, next/previous, and progress slider
Color-coded Highlights: Each extraction class gets a unique color
Attribute Display: Shows extraction attributes in a side panel
Smooth Animations: Automatic highlighting with configurable speed
GIF Optimization: Special styling for video capture and screenshots
Responsive Design: Works on different screen sizes
File Support: Load from JSONL files or AnnotatedDocument objects

Basic Visualization

import { visualize, saveVisualizationPage } from "langextract";

// Create a visualization from an AnnotatedDocument
const html = visualize(result, {
  animationSpeed: 1.0, // Seconds between extractions
  showLegend: true, // Show color legend
  gifOptimized: true, // Optimize for video capture
});

// Save as a complete HTML page
saveVisualizationPage(result, "./extraction-visualization.html", {
  animationSpeed: 1.5,
  showLegend: true,
  gifOptimized: false,
});

Visualization Options

interface VisualizationOptions {
  animationSpeed?: number; // Animation speed in seconds (default: 1.0)
  showLegend?: boolean; // Show color legend (default: true)
  gifOptimized?: boolean; // Optimize for GIFs (default: true)
  contextChars?: number; // Context characters around extractions (default: 150)
}

Loading from Files

// Visualize extractions from a JSONL file
const html = visualize("./extractions.jsonl", {
  animationSpeed: 0.8,
  showLegend: true,
});

Command Line Interface

LangExtract provides a CLI tool for easy visualization generation:

# Basic usage
npx ts-node bin/visualize.ts input.jsonl output.html

# With custom options
npx ts-node bin/visualize.ts input.jsonl output.html --speed 1.5 --gif-optimized

# Hide legend
npx ts-node bin/visualize.ts input.jsonl output.html --no-legend

# Using npm script
npm run visualize -- input.jsonl output.html --speed 0.8

CLI Options:

--speed <number>: Animation speed in seconds (default: 1.0)
--no-legend: Hide the color legend
--gif-optimized: Optimize styling for GIF/video capture
--context <number>: Context characters around extractions (default: 150)
--help: Show help message

Examples

# Create a fast animation for GIF capture
npx ts-node bin/visualize.ts extractions.jsonl demo.html --speed 0.5 --gif-optimized

# Create a presentation-friendly version
npx ts-node bin/visualize.ts extractions.jsonl presentation.html --speed 2.0 --no-legend

# Process multiple files
for file in *.jsonl; do
  npx ts-node bin/visualize.ts "$file" "${file%.jsonl}.html"
done

Error Handling

LangExtract provides comprehensive error handling for various scenarios:

try {
  const result = await extract(text, {
    examples: examples,
    apiKey: "your-api-key",
  });
} catch (error) {
  if (error instanceof Error) {
    console.error("Extraction failed:", error.message);
  }
}

Common Error Types

Missing API Key: Ensure your API key is provided via parameter or environment variable
Invalid Examples: Examples array must contain valid ExampleData objects
Model Errors: Check model ID and API key for the specified provider
File Not Found: Verify file paths for JSONL input files
Invalid Character Positions: Ensure charInterval positions are within text bounds

Configuration

Environment Variables

Set your API key as an environment variable:

export LANGEXTRACT_API_KEY="your-api-key"

TypeScript Configuration

Add to your tsconfig.json:

{
  "compilerOptions": {
    "esModuleInterop": true,
    "allowSyntheticDefaultImports": true,
    "target": "ES2020",
    "module": "commonjs",
    "strict": true,
    "declaration": true,
    "outDir": "./dist"
  }
}

Development Setup

# Clone and setup
git clone https://github.com/kmbro/langextract.git
cd langextract/typescript

# Install dependencies
npm install

# Build the project
npm run build

# Run tests
npm test

# Run specific integration tests (requires API key)
OPENAI_API_KEY=your-api-key npm test -- medical-extraction.test.ts

# Run visualization CLI
npm run visualize -- sample-extractions.jsonl output.html

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

License

Apache 2.0 License - see LICENSE file for details.

Support

For issues and questions, please open an issue on the GitHub repository.

[1.2.0] - 2025-01-XX

Added

Parallel processing for batch operations: Enhanced inference layer with concurrent processing capabilities
- Added Promise.all() implementation for batch prompts when maxWorkers > 1
- Automatic fallback to sequential processing for single prompts or when maxWorkers = 1
- Applied consistently across all language models (Gemini, OpenAI, Ollama)
Chunk-level batching: Improved batching strategy from document-level to chunk-level processing
- More efficient utilization of the maxWorkers configuration parameter
- Better performance for large document collections
Guaranteed document ID generation: Added automatic document ID generation for consistency
- Ensures all documents have unique IDs throughout the processing lifecycle
- Prevents potential issues with document identification across processing passes

Changed

QAPromptGenerator integration: Replaced simplified generatePrompt() method with proper QAPromptGeneratorImpl.render()
- Eliminated code duplication in prompt generation
- Better separation of concerns and maintainability
Batch processing architecture: Consolidated and streamlined batch processing methods
- Unified processing flow for both single-pass and multi-pass workflows
- Improved Map-based data management for better performance
- Cleaner, more maintainable code structure

Performance

Significantly improved throughput for batch operations through parallel processing
Better resource utilization with chunk-level batching strategy
Reduced processing time for large document collections
Optimized memory usage with improved data structures

Technical

Internal refactoring: No breaking changes to public API
Enhanced error handling: Consistent error handling across all language models
Improved type safety: Better TypeScript interfaces and type guarantees
Code quality: Eliminated duplication and improved maintainability

Thanks to tomquist for the contributions!

Keywords

nlp

information-extraction

llm

typescript

gemini

FAQs

What is langextract?

Is langextract popular?

Is langextract well maintained?

Package last updated on 16 Aug 2025

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

langextract

LangExtract TypeScript

Features

Installation

Quick Start

Basic Extraction

API Reference

Main Functions

extract(textOrDocuments, options)

Core Types

ExampleData

CharInterval

Extraction

Document

AnnotatedDocument

Visualization Functions

visualize(dataSource, options)

saveVisualizationPage(dataSource, outputPath, options)

Advanced Usage

Model Configuration

Google Gemini

OpenAI

Ollama (Local Models)

Response Control

Limiting Response Length with maxTokens

Custom Model URLs

Prompt Engineering

Custom Prompt Templates

Output Processing

Custom Resolvers

Schema Enforcement

Performance Optimization

Batch Processing

Examples

Use Cases

Medical Entity Extraction

Named Entity Recognition

Visualization

Features

Basic Visualization

Visualization Options

Loading from Files

Command Line Interface

Examples

Error Handling

Common Error Types

Configuration

Environment Variables

TypeScript Configuration

Development Setup

Contributing

License

Support

[1.2.0] - 2025-01-XX

Added

Changed

Performance

Technical

Keywords

Related posts

AGENTS.md Gains Traction as an Open Format for AI Coding Agents

Wallet-Draining npm Package Impersonates Nodemailer to Hijack Crypto Transactions

`extract(textOrDocuments, options)`

`ExampleData`

`CharInterval`

`Extraction`

`Document`

`AnnotatedDocument`

`visualize(dataSource, options)`

`saveVisualizationPage(dataSource, outputPath, options)`