EvalPlus(📖) => 📚
📰News •
🔥Quick Start •
🚀LLM Backends •
📚Documents •
📜Citation •
🙏Acknowledgement
About
EvalPlus is a rigorous evaluation framework for LLM4Code, with:
- ✨ HumanEval+: 80x more tests than the original HumanEval!
- ✨ MBPP+: 35x more tests than the original MBPP!
- ✨ EvalPerf: evaluating the efficiency of LLM-generated code!
- ✨ Framework: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.
Why EvalPlus?
- ✨ Precise evaluation & ranking: See our leaderboard for latest LLM rankings before & after rigorous evaluation.
- ✨ Coding rigorousness: Look at the score differences! esp. before and after using EvalPlus tests! Less drop is better as it means more rigorousness and less laxity in code generation; while a big drop means the generated code tends to be fragile.
- ✨ Code efficiency: Beyond correctness, our EvalPerf dataset evaluates the efficiency of LLM-generated code via performance-exercising coding tasks and test inputs.
Want to know more details? Read our papers & materials!
📰 News
Below tracks the notable updates of EvalPlus:
- [2024-10-20
v0.3.1
]: EvalPlus v0.3.1
is officially released! Release highlights includes (i) Code efficiency evaluation via EvalPerf, (ii) one command to run the whole pipline (generation + post-processing + evaluation), (iii) support for more inference backends such as Google Gemini & Anthropic, etc. - [2024-06-09 pre
v0.3.0
]: Improved ground-truth solutions for MBPP+ tasks (IDs: 459, 102, 559). Thanks to EvalArena. - [2024-04-17 pre
v0.3.0
]: MBPP+ is upgraded to v0.2.0
by removing some broken tasks (399 -> 378 tasks). ~4pp pass@1 improvement could be expected. - Earlier:
- (
v0.2.1
) You can use EvalPlus datasets via bigcode-evaluation-harness! HumanEval+ oracle fixes (32). - (
v0.2.0
) MBPP+ is released! HumanEval contract & input fixes (0/3/9/148/114/1/2/99/28/32/35/160). - (
v0.1.7
) Leaderboard release; HumanEval+ contract and input fixes (32/166/126/6) - (
v0.1.6
) Configurable and by-default-conservative timeout settings; HumanEval+ contract & ground-truth fixes (129/148/75/53/0/3/9/140) - (
v0.1.5
) HumanEval+ mini is released for ultra-fast evaluation when you have too many samples! - (
v0.1.1
) Optimizing user experiences: evaluation speed, PyPI package, Docker, etc. - (
v0.1.0
) HumanEval+ is released!
🔥 Quick Start
- Code correctness evaluation: HumanEval(+) or MBPP(+)
pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus"
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend vllm \
--greedy
Code execution within Docker :: click to expand ::
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset humaneval \
--backend vllm \
--greedy
docker run --rm ganler/evalplus:latest -v $(pwd)/evalplus_results:/app \
evalplus.evaluate --dataset humaneval \
--samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_0.0.jsonl
- Code efficiency evaluation: EvalPerf (*nix only)
pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"
sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid'
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--backend vllm
Code execution within Docker :: click to expand ::
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset evalperf \
--backend vllm \
--temperture 1.0 \
--n-samples 100
docker run --cap-add PERFMON --rm ganler/evalplus:latest -v $(pwd)/evalplus_results:/app \
evalplus.evalperf --samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_1.0.jsonl
🚀 LLM Backends
HuggingFace models
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend hf \
--greedy
[!Note]
EvalPlus uses different prompts for base and chat models.
By default it is detected by tokenizer.chat_template
when using hf
/vllm
as backend.
For other backends, only chat mode is allowed.
Therefore, if your base models come with a tokenizer.chat_template
,
please add --force-base-prompt
to avoid being evaluated
in a chat mode.
Enable Flash Attention 2 :: click to expand ::
pip install packaging ninja
pip install flash-attn --no-build-isolation
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend hf \
--attn-implementation [flash_attention_2|sdpa] \
--greedy
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend vllm \
--tp [TENSOR_PARALLEL_SIZE] \
--greedy
openai
compatible servers (e.g., vLLM):
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend openai \
--base-url http://localhost:8000/v1 \
--greedy
OpenAI models
export OPENAI_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "gpt-4o" \
--dataset [humaneval|mbpp] \
--backend openai \
--greedy
Anthropic models
export ANTHROPIC_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "claude-3-haiku-20240307" \
--dataset [humaneval|mbpp] \
--backend anthropic \
--greedy
Google Gemini models
export GOOGLE_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "gemini-1.5-pro" \
--dataset [humaneval|mbpp] \
--backend google \
--greedy
You can checkout the generation and results at evalplus_results/[humaneval|mbpp]/
⏬ Using EvalPlus as a local repo? :: click to expand ::
git clone https://github.com/evalplus/evalplus.git
cd evalplus
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt
📚 Documents
To learn more about how to use EvalPlus, please refer to:
📜 Citation
@inproceedings{evalplus,
title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming},
booktitle = {Thirty-seventh Conference on Neural Information Processing Systems},
year = {2023},
url = {https://openreview.net/forum?id=1qvx610Cu7},
}
@inproceedings{evalperf,
title = {Evaluating Language Models for Efficient Code Generation},
author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming},
booktitle = {First Conference on Language Modeling},
year = {2024},
url = {https://openreview.net/forum?id=IBCBMeAhmC},
}
🙏 Acknowledgement