LongMemEval is a comprehensive benchmark designed by researchers to evaluate the long-term memory capabilities of chat assistants. It was introduced in the paper:

"LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory"
Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, Dong Yu (ICLR 2025)
📄 Paper | 🌐 Website | 🤗 Dataset

What LongMemEval Tests

The benchmark evaluates five core long-term memory abilities through 500 meticulously curated questions:

Information Extraction - Recalling specific information from extensive interactive histories
Multi-Session Reasoning - Synthesizing information across multiple history sessions
Knowledge Updates - Handling information that changes over time
Temporal Reasoning - Understanding time-based relationships in conversations
Abstention - Recognizing when information is not available

Why This Matters

Current LLMs show a 30-60% performance drop when tested on LongMemEval, revealing significant challenges in maintaining coherent long-term memory. This benchmark helps identify and improve these limitations.

Quick Start

# From packages/longmemeval directory

# 1. Set your API keys
export OPENAI_API_KEY=your_openai_key_here
export HF_TOKEN=your_huggingface_token_here  # For automatic dataset download

# 2. Run a benchmark (downloads datasets automatically if needed)
pnpm bench:s          # Run small dataset (10 parallel requests)
pnpm bench:m          # Run medium dataset (10 parallel requests)
pnpm bench:oracle     # Run oracle dataset (10 parallel requests)

# Or run quick 10-question tests
pnpm bench:s:quick    # Test with 10 questions from small dataset
pnpm bench:m:quick    # Test with 10 questions from medium dataset
pnpm bench:oracle:quick # Test with 10 questions from oracle dataset

Note: The benchmark will automatically download datasets on first run. Get your HuggingFace token from https://huggingface.co/settings/tokens

Manual Setup

1. Install Dependencies

# From the monorepo root
pnpm install
pnpm build

2. Download Dataset

# Set your HuggingFace token
export HF_TOKEN=your_token_here

# Download datasets (no Python or Git LFS required)
pnpm download

If automatic download fails, see DOWNLOAD_GUIDE.md for manual download instructions.

Usage

Run Benchmark

# From packages/longmemeval directory

# Quick commands for each dataset (10 parallel requests)
pnpm bench:s          # Small dataset (full run)
pnpm bench:m          # Medium dataset (full run)
pnpm bench:oracle     # Oracle dataset (full run)

# Quick test runs (10 questions only, 5 parallel)
pnpm bench:s:quick    # Small dataset (quick test)
pnpm bench:m:quick    # Medium dataset (quick test)
pnpm bench:oracle:quick # Oracle dataset (quick test)

# Advanced: Use full CLI with custom options
pnpm cli run --dataset longmemeval_s --model gpt-4o

# Adjust parallelization (default: 5)
pnpm cli run --dataset longmemeval_s --model gpt-4o --concurrency 20

# Graceful shutdown: Press Ctrl+C to stop and save progress

# Run with specific memory configuration
pnpm cli run --dataset longmemeval_s --memory-config last-k --model gpt-4o
pnpm cli run --dataset longmemeval_s --memory-config semantic-recall --model gpt-4o
pnpm cli run --dataset longmemeval_s --memory-config working-memory --model gpt-4o

# Custom subset size
pnpm cli run --dataset longmemeval_oracle --model gpt-4o --subset 25

View Dataset Statistics

pnpm cli stats --dataset longmemeval_s

Evaluate Existing Results

pnpm cli evaluate --results ./results/run_12345/results.jsonl --dataset longmemeval_s

Generate Report

pnpm cli report --results ./results/

Memory Configurations

full-history: Provide complete chat history (baseline)
last-k: Use Mastra's lastMessages configuration (last 20 messages)
semantic-recall: Use Mastra's semantic recall feature (requires vector store)
working-memory: Use Mastra's working memory with template
combined: Combination of last-k and semantic recall

Output

Results are saved in the results/ directory with:

results.jsonl: Individual question results
hypotheses.json: Model responses
questions.json: Questions for reference
metrics.json: Aggregated metrics and configuration

Benchmark Datasets

LongMemEval provides three dataset variants:

longmemeval_s (Small): ~115k tokens per question (30-40 sessions)
- Designed to fit within 128k context windows
- Tests memory across dozens of conversation sessions
longmemeval_m (Medium): ~1.5M tokens per question (500 sessions)
- Challenges even the largest context windows
- Tests memory across hundreds of sessions
longmemeval_oracle: Only evidence sessions included
- Used as a control to verify models can answer when given only relevant context
- Helps isolate memory retrieval issues from comprehension issues

Citation

If you use this benchmark in your research, please cite the original paper:

@article{wu2024longmemeval,
  title={LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory},
  author={Wu, Di and Wang, Hongwei and Yu, Wenhao and Zhang, Yuwei and Chang, Kai-Wei and Yu, Dong},
  journal={arXiv preprint arXiv:2410.10813},
  year={2024}
}

Extending the Benchmark

To add custom memory configurations:

Edit src/benchmark/runner.ts and add your configuration to getMemoryConfig()
Update the MemoryConfigType in src/data/types.ts
Implement the configuration logic in src/memory-adapters/mastra-adapter.ts

FAQs

What is @mastra/longmemeval?

Is @mastra/longmemeval popular?

Is @mastra/longmemeval well maintained?

Package last updated on 17 Jun 2026

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

@mastra/longmemeval

LongMemEval Benchmark for Mastra

About LongMemEval