
Product
Introducing Tier 1 Reachability: Precision CVE Triage for Enterprise Teams
Socket’s new Tier 1 Reachability filters out up to 80% of irrelevant CVEs, so security teams can focus on the vulnerabilities that matter.
💥 Impact • 📰 News • 🔥 Quick Start • 🚀 Remote Evaluation • 💻 LLM-generated Code • 🧑 Advanced Usage • 📰 Result Submission • 📜 Citation
BigCodeBench has been trusted by many LLM teams including:
bigcodebench==v0.2.2.dev2
, with 163 models evaluated!bigcodebench==v0.2.0
!bigcodebench==v0.1.9
.bigcodebench==v0.1.8
.bigcodebench==v0.1.7
.bigcodebench==v0.1.6
.0.1.5
.BigCodeBench is an easy-to-use benchmark for solving practical and challenging tasks via code. It aims to evaluate the true programming capabilities of large language models (LLMs) in a more realistic setting. The benchmark is designed for HumanEval-like function-level code generation tasks, but with much more complex instructions and diverse function calls.
There are two splits in BigCodeBench:
Complete
: Thes split is designed for code completion based on the comprehensive docstrings.Instruct
: The split works for the instruction-tuned and chat models only, where the models are asked to generate a code snippet based on the natural language instructions. The instructions only contain necessary information, and require more complex reasoning.BigCodeBench focuses on task automation via code generation with diverse function calls and complex instructions, with:
To get started, please first set up the environment:
# By default, you will use the remote evaluation API to execute the output samples.
pip install bigcodebench --upgrade
# You are suggested to use `flash-attn` for generating code samples.
pip install packaging ninja
pip install flash-attn --no-build-isolation
# Note: if you have installation problem, consider using pre-built
# wheels from https://github.com/Dao-AILab/flash-attention/releases
# Install to use bigcodebench.generate
pip install "git+https://github.com/bigcode-project/bigcodebench.git" --upgrade
We use the greedy decoding as an example to show how to evaluate the generated code samples via remote API.
[!Warning]
To ease the generation, we use batch inference by default. However, the batch inference results could vary from batch sizes to batch sizes and versions to versions, at least for the vLLM backend. If you want to get more deterministic results for greedy decoding, please set
--bs
to1
.
[!Note]
gradio
backend onBigCodeBench-Full
typically takes 6-7 minutes, and onBigCodeBench-Hard
typically takes 4-5 minutes.e2b
backend with default machine onBigCodeBench-Full
typically takes 25-30 minutes, and onBigCodeBench-Hard
typically takes 15-20 minutes.
bigcodebench.evaluate \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--execution [e2b|gradio|local] \
--split [complete|instruct] \
--subset [full|hard] \
--backend [vllm|openai|anthropic|google|mistral|hf|hf-inference]
bcb_results
.[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated.jsonl
.[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_eval_results.json
.[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_pass_at_k.json
.[!Note]
The
gradio
backend is hosted on the Hugging Face space by default. The default space can be sometimes slow, so we recommend you to use thegradio
backend with a cloned bigcodebench-evaluator endpoint for faster evaluation. Otherwise, you can also use thee2b
sandbox for evaluation, which is also pretty slow on the default machine.
[!Note]
BigCodeBench uses different prompts for base and chat models. By default it is detected by
tokenizer.chat_template
when usinghf
/vllm
as backend. For other backends, only chat mode is allowed.Therefore, if your base models come with a
tokenizer.chat_template
, please add--direct_completion
to avoid being evaluated in a chat mode.
To use E2B, you need to set up an account and get an API key from E2B.
export E2B_API_KEY=<your_e2b_api_key>
Access OpenAI APIs from OpenAI Console
export OPENAI_API_KEY=<your_openai_api_key>
Access Anthropic APIs from Anthropic Console
export ANTHROPIC_API_KEY=<your_anthropic_api_key>
Access Mistral APIs from Mistral Console
export MISTRAL_API_KEY=<your_mistral_api_key>
Access Gemini APIs from Google AI Studio
export GOOGLE_API_KEY=<your_google_api_key>
Access the Hugging Face Serverless Inference API
export HF_INFERENCE_API_KEY=<your_hf_api_key>
Please make sure your HF access token has the Make calls to inference providers
permission.
We share pre-generated code samples from LLMs we have evaluated on the full set:
sanitized_samples_calibrated.zip
for your convenience.Please refer to the ADVANCED USAGE for more details.
Please email both the generated code samples and the execution results to terry.zhuo@monash.edu if you would like to contribute your model to the leaderboard. Note that the file names should be in the format of [model_name]--[revision]--[bigcodebench|bigcodebench-hard]-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated.jsonl
and [model_name]--[revision]--[bigcodebench|bigcodebench-hard]-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_eval_results.json
. You can file an issue to remind us if we do not respond to your email within 3 days.
@article{zhuo2024bigcodebench,
title={BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions},
author={Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and others},
journal={arXiv preprint arXiv:2406.15877},
year={2024}
}
FAQs
"Evaluation package for BigCodeBench"
We found that bigcodebench demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Product
Socket’s new Tier 1 Reachability filters out up to 80% of irrelevant CVEs, so security teams can focus on the vulnerabilities that matter.
Research
/Security News
Ongoing npm supply chain attack spreads to DuckDB: multiple packages compromised with the same wallet-drainer malware.
Security News
The MCP Steering Committee has launched the official MCP Registry in preview, a central hub for discovering and publishing MCP servers.