English | 简体中文
📖 Documents
⭐ If you like this project, please click the "Star" button at the top right to support us. Your support is our motivation to keep going!
📋 Table of Contents
📝 Introduction
EvalScope is the official model evaluation and performance benchmarking framework launched by the ModelScope community. It comes with built-in common benchmarks and evaluation metrics, such as MMLU, CMMLU, C-Eval, GSM8K, ARC, HellaSwag, TruthfulQA, MATH, and HumanEval. EvalScope supports various types of model evaluations, including LLMs, multimodal LLMs, embedding models, and reranker models. It is also applicable to multiple evaluation scenarios, such as end-to-end RAG evaluation, arena mode, and model inference performance stress testing. Moreover, with the seamless integration of the ms-swift training framework, evaluations can be initiated with a single click, providing full end-to-end support from model training to evaluation 🚀
EvalScope Framework.
The architecture includes the following modules:
- Model Adapter: The model adapter is used to convert the outputs of specific models into the format required by the framework, supporting both API call models and locally run models.
- Data Adapter: The data adapter is responsible for converting and processing input data to meet various evaluation needs and formats.
- Evaluation Backend:
- Native: EvalScope’s own default evaluation framework, supporting various evaluation modes, including single model evaluation, arena mode, baseline model comparison mode, etc.
- OpenCompass: Supports OpenCompass as the evaluation backend, providing advanced encapsulation and task simplification, allowing you to submit tasks for evaluation more easily.
- VLMEvalKit: Supports VLMEvalKit as the evaluation backend, enabling easy initiation of multi-modal evaluation tasks, supporting various multi-modal models and datasets.
- RAGEval: Supports RAG evaluation, supporting independent evaluation of embedding models and rerankers using MTEB/CMTEB, as well as end-to-end evaluation using RAGAS.
- ThirdParty: Other third-party evaluation tasks, such as ToolBench.
- Performance Evaluator: Model performance evaluation, responsible for measuring model inference service performance, including performance testing, stress testing, performance report generation, and visualization.
- Evaluation Report: The final generated evaluation report summarizes the model's performance, which can be used for decision-making and further model optimization.
- Visualization: Visualization results help users intuitively understand evaluation results, facilitating analysis and comparison of different model performances.
🎉 News
- 🔥 [2024.11.26] The model inference service performance evaluator has been completely refactored: it now supports local inference service startup and Speed Benchmark; asynchronous call error handling has been optimized. For more details, refer to the 📖 User Guide.
- 🔥 [2024.10.31] The best practice for evaluating Multimodal-RAG has been updated, please check the 📖 Blog for more details.
- 🔥 [2024.10.23] Supports multimodal RAG evaluation, including the assessment of image-text retrieval using CLIP_Benchmark, and extends RAGAS to support end-to-end multimodal metrics evaluation.
- 🔥 [2024.10.8] Support for RAG evaluation, including independent evaluation of embedding models and rerankers using MTEB/CMTEB, as well as end-to-end evaluation using RAGAS.
- 🔥 [2024.09.18] Our documentation has been updated to include a blog module, featuring some technical research and discussions related to evaluations. We invite you to 📖 read it.
- 🔥 [2024.09.12] Support for LongWriter evaluation, which supports 10,000+ word generation. You can use the benchmark LongBench-Write to measure the long output quality as well as the output length.
- 🔥 [2024.08.30] Support for custom dataset evaluations, including text datasets and multimodal image-text datasets.
- 🔥 [2024.08.20] Updated the official documentation, including getting started guides, best practices, and FAQs. Feel free to 📖read it here!
- 🔥 [2024.08.09] Simplified the installation process, allowing for pypi installation of vlmeval dependencies; optimized the multimodal model evaluation experience, achieving up to 10x acceleration based on the OpenAI API evaluation chain.
- 🔥 [2024.07.31] Important change: The package name
llmuses
has been changed to evalscope
. Please update your code accordingly. - 🔥 [2024.07.26] Support for VLMEvalKit as a third-party evaluation framework to initiate multimodal model evaluation tasks.
- 🔥 [2024.06.29] Support for OpenCompass as a third-party evaluation framework, which we have encapsulated at a higher level, supporting pip installation and simplifying evaluation task configuration.
- 🔥 [2024.06.13] EvalScope seamlessly integrates with the fine-tuning framework SWIFT, providing full-chain support from LLM training to evaluation.
- 🔥 [2024.06.13] Integrated the Agent evaluation dataset ToolBench.
🛠️ Installation
Method 1: Install Using pip
We recommend using conda to manage your environment and installing dependencies with pip:
-
Create a conda environment (optional)
# It is recommended to use Python 3.10
conda create -n evalscope python=3.10
# Activate the conda environment
conda activate evalscope
-
Install dependencies using pip
pip install evalscope # Install Native backend (default)
# Additional options
pip install evalscope[opencompass] # Install OpenCompass backend
pip install evalscope[vlmeval] # Install VLMEvalKit backend
pip install evalscope[rag] # Install RAGEval backend
pip install evalscope[perf] # Install Perf dependencies
pip install evalscope[all] # Install all backends (Native, OpenCompass, VLMEvalKit, RAGEval)
[!WARNING]
As the project has been renamed to evalscope
, for versions v0.4.3
or earlier, you can install using the following command:
pip install llmuses<=0.4.3
To import relevant dependencies using llmuses
:
from llmuses import ...
Method 2: Install from Source
-
Download the source code
git clone https://github.com/modelscope/evalscope.git
-
Install dependencies
cd evalscope/
pip install -e . # Install Native backend
# Additional options
pip install -e '.[opencompass]' # Install OpenCompass backend
pip install -e '.[vlmeval]' # Install VLMEvalKit backend
pip install -e '.[rag]' # Install RAGEval backend
pip install -e '.[perf]' # Install Perf dependencies
pip install -e '.[all]' # Install all backends (Native, OpenCompass, VLMEvalKit, RAGEval)
🚀 Quick Start
1. Simple Evaluation
To evaluate a model using default settings on specified datasets, follow the process below:
Installation using pip
You can execute this in any directory:
python -m evalscope.run \
--model Qwen/Qwen2.5-0.5B-Instruct \
--template-type qwen \
--datasets gsm8k ceval \
--limit 10
Installation from source
You need to execute this in the evalscope
directory:
python evalscope/run.py \
--model Qwen/Qwen2.5-0.5B-Instruct \
--template-type qwen \
--datasets gsm8k ceval \
--limit 10
If prompted with Do you wish to run the custom code? [y/N]
, please type y
.
Results (tested with only 10 samples)
Report table:
+-----------------------+--------------------+-----------------+
| Model | ceval | gsm8k |
+=======================+====================+=================+
| Qwen2.5-0.5B-Instruct | (ceval/acc) 0.5577 | (gsm8k/acc) 0.5 |
+-----------------------+--------------------+-----------------+
Basic Parameter Descriptions
--model
: Specifies the model_id
of the model on ModelScope, allowing automatic download. For example, see the Qwen2-0.5B-Instruct model link; you can also use a local path, such as /path/to/model
.--template-type
: Specifies the template type corresponding to the model. Refer to the Default Template
field in the template table for filling in this field.--datasets
: The dataset name, allowing multiple datasets to be specified, separated by spaces; these datasets will be automatically downloaded. Refer to the supported datasets list for available options.--limit
: Maximum number of evaluation samples per dataset; if not specified, all will be evaluated, which is useful for quick validation.
2. Parameterized Evaluation
If you wish to conduct a more customized evaluation, such as modifying model parameters or dataset parameters, you can use the following commands:
Example 1:
python evalscope/run.py \
--model qwen/Qwen2-0.5B-Instruct \
--template-type qwen \
--model-args revision=master,precision=torch.float16,device_map=auto \
--datasets gsm8k ceval \
--use-cache true \
--limit 10
Example 2:
python evalscope/run.py \
--model qwen/Qwen2-0.5B-Instruct \
--template-type qwen \
--generation-config do_sample=false,temperature=0.0 \
--datasets ceval \
--dataset-args '{"ceval": {"few_shot_num": 0, "few_shot_random": false}}' \
--limit 10
Parameter Descriptions
In addition to the three basic parameters, the other parameters are as follows:
--model-args
: Model loading parameters, separated by commas, in key=value
format.--generation-config
: Generation parameters, separated by commas, in key=value
format.
do_sample
: Whether to use sampling, default is false
.max_new_tokens
: Maximum generation length, default is 1024.temperature
: Sampling temperature.top_p
: Sampling threshold.top_k
: Sampling threshold.
--use-cache
: Whether to use local cache, default is false
. If set to true
, previously evaluated model and dataset combinations will not be evaluated again, and will be read directly from the local cache.--dataset-args
: Evaluation dataset configuration parameters, provided in JSON format, where the key is the dataset name and the value is the parameter; note that these must correspond one-to-one with the values in --datasets
.
--few_shot_num
: Number of few-shot examples.--few_shot_random
: Whether to randomly sample few-shot data; if not specified, defaults to true
.
3. Use the run_task Function to Submit an Evaluation Task
Using the run_task
function to submit an evaluation task requires the same parameters as the command line. You need to pass a dictionary as the parameter, which includes the following fields:
1. Configuration Task Dictionary Parameters
import torch
from evalscope.constants import DEFAULT_ROOT_CACHE_DIR
your_task_cfg = {
'model_args': {'revision': None, 'precision': torch.float16, 'device_map': 'auto'},
'generation_config': {'do_sample': False, 'repetition_penalty': 1.0, 'max_new_tokens': 512},
'dataset_args': {},
'dry_run': False,
'model': 'qwen/Qwen2-0.5B-Instruct',
'template_type': 'qwen',
'datasets': ['arc', 'hellaswag'],
'work_dir': DEFAULT_ROOT_CACHE_DIR,
'outputs': DEFAULT_ROOT_CACHE_DIR,
'mem_cache': False,
'dataset_hub': 'ModelScope',
'dataset_dir': DEFAULT_ROOT_CACHE_DIR,
'limit': 10,
'debug': False
}
Here, DEFAULT_ROOT_CACHE_DIR
is set to '~/.cache/evalscope'
.
2. Execute Task with run_task
from evalscope.run import run_task
run_task(task_cfg=your_task_cfg)
Evaluation Backend
EvalScope supports using third-party evaluation frameworks to initiate evaluation tasks, which we call Evaluation Backend. Currently supported Evaluation Backend includes:
- Native: EvalScope's own default evaluation framework, supporting various evaluation modes including single model evaluation, arena mode, and baseline model comparison mode.
- OpenCompass: Initiate OpenCompass evaluation tasks through EvalScope. Lightweight, easy to customize, supports seamless integration with the LLM fine-tuning framework ms-swift. 📖 User Guide
- VLMEvalKit: Initiate VLMEvalKit multimodal evaluation tasks through EvalScope. Supports various multimodal models and datasets, and offers seamless integration with the LLM fine-tuning framework ms-swift. 📖 User Guide
- RAGEval: Initiate RAG evaluation tasks through EvalScope, supporting independent evaluation of embedding models and rerankers using MTEB/CMTEB, as well as end-to-end evaluation using RAGAS: 📖 User Guide
- ThirdParty: Third-party evaluation tasks, such as ToolBench and LongBench-Write.
Model Serving Performance Evaluation
A stress testing tool focused on large language models, which can be customized to support various dataset formats and different API protocol formats.
Reference: Performance Testing 📖 User Guide
Supports wandb for recording results
Supports Speed Benchmark
It supports speed testing and provides speed benchmarks similar to those found in the official Qwen reports:
Speed Benchmark Results:
+---------------+-----------------+----------------+
| Prompt Tokens | Speed(tokens/s) | GPU Memory(GB) |
+---------------+-----------------+----------------+
| 1 | 50.69 | 0.97 |
| 6144 | 51.36 | 1.23 |
| 14336 | 49.93 | 1.59 |
| 30720 | 49.56 | 2.34 |
+---------------+-----------------+----------------+
Custom Dataset Evaluation
EvalScope supports custom dataset evaluation. For detailed information, please refer to the Custom Dataset Evaluation 📖User Guide
Offline Evaluation
You can use local dataset to evaluate the model without internet connection.
Refer to: Offline Evaluation 📖 User Guide
Arena Mode
The Arena mode allows multiple candidate models to be evaluated through pairwise battles, and can choose to use the AI Enhanced Auto-Reviewer (AAR) automatic evaluation process or manual evaluation to obtain the evaluation report.
Refer to: Arena Mode 📖 User Guide
TO-DO List
Star History