EchoSwift: LLM Inference Benchmarking Tool by Infobell IT
EchoSwift is a CLI-based tool designed to benchmark LLM (Large Language Model) inference endpoints. It helps evaluate performance using real-world prompts with configurable parameters and visualized results.
Features
- Easy-to-use CLI interface
- Benchmark LLM inference across multiple Inference Servers
- Measures key performance metrics: latency, throughput, and TTFT (Time to First Token)
- Support for varying input and output token lengths
- Simulate concurrent users to test scalability
- Determine the optimal number of concurrent users the server can handle while maintaining: TTFT < 2000 ms and Token latency < 200 ms
- Detailed logging and progress tracking
Supported Inference Servers
- TGI
- vLLM
- Ollama
- Llamacpp
- NIMS
- SGLang
Performance metrics:
The performance metrics captured for varying input and output tokens and parallel users while running the benchmark includes
- Latency (ms/token)
- TTFT(ms)
- Throughput(tokens/sec)
Installation
You can install EchoSwift using pip:
pip install echoswift
Alternatively, you can install from source:
git clone https://github.com/Infobellit-Solutions-Pvt-Ltd/EchoSwift.git
cd EchoSwift
pip install -e .
Usage
EchoSwift provides a simple CLI interface for running LLM Inference benchmarks.
Below are the steps to run a sample test, assuming the generation endpoint is active.
1. Download the Dataset and create a default config.json
Before running a benchmark, you need to download and filter the dataset:
echoswift dataprep
This command will:
- Download the filtered ShareGPT dataset from Huggingface
- Create a default
config.json
file in your working directory
2. Configure the Benchmark
Edit the generated config.json
file to match your LLM server configuration. Below is a sample:
{
"_comment": "EchoSwift Configuration",
"out_dir": "Results",
"base_url": "http://localhost:8000/v1/completions",
"tokenizer_path": "/path/to/tokenizer/",
"inference_server": "vLLM",
"model": "/model",
"random_prompt": true,
"max_requests": 1,
"user_counts": [
10
],
"increment_user": [
100
],
"input_tokens": [
32
],
"output_tokens": [
256
]
}
Note: Modify base_url, tokenizer_path, model, and other fields according to your LLM deployment.
🔧 Prompt Configuration Modes
EchoSwift supports two input modes depending on your test requirements:
✅ Fixed Input Tokens
If you want to run the benchmark with a fixed number of input tokens:
- Set
"random_prompt": false
- Define both
input_tokens
and output_tokens
explicitly
🎲 Random Input Length
If you prefer using randomized prompts from the dataset:
- Set
"random_prompt": true
- Provide only
output_tokens
— EchoSwift will choose random input lengths from the dataset
You can use these configuration with both:
echoswift start
(standard benchmark)
echoswift optimaluserrun
(to determine optimal concurrency)
👥 User Load Configuration (For optimaluserrun
)
To perform optimal user benchmarking:
- Use
user_counts
to set the initial number of concurrent users
- Use
increment_user
to define how many users to add per step
Example:
"user_counts": [10],
"increment_user": [100]
In this case, the benchmark will start with 10 users and increase by 100 in each iteration until performance thresholds are hit.
🔤 Tokenizer Configuration
EchoSwift allows two ways to configure the tokenizer used for benchmarking:
Option 1: Use a Custom Tokenizer
Set the TOKENIZER
environment variable to the path of your desired tokenizer.
Option 2: Use Default Fallback
If TOKENIZER
is not set or is empty, EchoSwift falls back to a built-in default tokenizer:
This ensures the tool remains functional, but the fallback tokenizer may not align with your model's behavior. Use it only for testing or when no tokenizer is specified.
✅ Best Practice: Always specify the correct tokenizer that matches your LLM model for accurate benchmarking results.
Use these combinations as per your requirement to effectively benchmark your LLM endpoint.
3. Run the Benchmark
Option A: Standard Benchmarking
Use the start command to run a basic benchmark:
echoswift start --config path/to/config.json
Option B: Optimal User Load Benchmarking
To find the optimal number of concurrent users for your LLM endpoint:
echoswift optimaluserrun --config path/to/config.json
4. Plot the Results
Visualize the benchmark results using the built-in plotting tool:
echoswift plot --results-dir path/to/your/results_dir
Output
EchoSwift will create a results
directory (or the directory specified in out_dir
) containing:
- CSV files with raw benchmark data
- Averaged results for each combination of users, input tokens, and output tokens
- Log files for each Locust run
Analyzing Results
After the benchmark completes, you can find CSV files in the output directory. These files contain information about latency, throughput, and TTFT for each test configuration.
Citation
If you find our resource useful, please cite our paper:
@inproceedings{Krishna2024,
series = {ICPE '24},
title = {EchoSwift: An Inference Benchmarking and Configuration Discovery Tool for Large Language Models (LLMs)},
url = {https://dl.acm.org/doi/10.1145/3629527.3652273},
DOI = {10.1145/3629527.3652273},
booktitle = {Companion of the 15th ACM/SPEC International Conference on Performance Engineering},
publisher = {ACM},
author = {Krishna, Karthik and Bandili, Ramana},
year = {2024},
month = May,
collection = {ICPE '24}
}