
Research
/Security News
Critical Vulnerability in NestJS Devtools: Localhost RCE via Sandbox Escape
A flawed sandbox in @nestjs/devtools-integration lets attackers run code on your machine via CSRF, leading to full Remote Code Execution (RCE).
github.com/evolvinglmms-lab/lmms-eval
Accelerating the development of large multimodal models (LMMs) with
lmms-eval
. We support most text, image, video and audio tasks.
🏠 LMMs-Lab Homepage | 🤗 Huggingface Datasets | discord/lmms-eval
📖 Supported Tasks (90+) | 🌟 Supported Models (30+) | 📚 Documentation
We warmly welcome contributions from the open-source community!
vllm
into our models, enabling accelerated evaluation for both multimodal and language models. Additionally, we have incorporated openai_compatible
to support the evaluation of any API-based model that follows the OpenAI API format. Check the usages here.lmms-eval/v0.3.0
has been upgraded to support audio evaluations for audio models like Qwen2-Audio and Gemini-Audio across tasks such as AIR-Bench, Clotho-AQA, LibriSpeech, and more. Please refer to the blog for more details!lmms-eval
to 0.2.3
with more tasks and features. We support a compact set of language tasks evaluations (code credit to lm-evaluation-harness), and we remove the registration logic at start (for all models and tasks) to reduce the overhead. Now lmms-eval
only launches necessary tasks/models. Please check the release notes for more details.lmms-eval/v0.2.1
has been upgraded to support more models, including LongVA, InternVL-2, VILA, and many more evaluation tasks, e.g. Details Captions, MLVU, WildVision-Bench, VITATECS and LLaVA-Interleave-Bench.lmms-eval/v0.2.0
has been upgraded to support video evaluations for video models like LLaVA-NeXT Video and Gemini 1.5 Pro across tasks such as EgoSchema, PerceptionTest, VideoMME, and more. Please refer to the blog for more details!lmms-eval
, please refer to the blog for more details!lmms-eval
?We're on an exciting journey toward creating Artificial General Intelligence (AGI), much like the enthusiasm of the 1960s moon landing. This journey is powered by advanced large language models (LLMs) and large multimodal models (LMMs), which are complex systems capable of understanding, learning, and performing a wide variety of human tasks.
To gauge how advanced these models are, we use a variety of evaluation benchmarks. These benchmarks are tools that help us understand the capabilities of these models, showing us how close we are to achieving AGI. However, finding and using these benchmarks is a big challenge. The necessary benchmarks and datasets are spread out and hidden in various places like Google Drive, Dropbox, and different school and research lab websites. It feels like we're on a treasure hunt, but the maps are scattered everywhere.
In the field of language models, there has been a valuable precedent set by the work of lm-evaluation-harness. They offer integrated data and model interfaces, enabling rapid evaluation of language models and serving as the backend support framework for the open-llm-leaderboard, and has gradually become the underlying ecosystem of the era of foundation models.
We humbly obsorbed the exquisite and efficient design of lm-evaluation-harness and introduce lmms-eval, an evaluation framework meticulously crafted for consistent and efficient evaluation of LMM.
For direct usage, you can install the package from Git by running the following command:
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv eval
uv venv --python 3.12
source eval/bin/activate
uv pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git
For development, you can install the package by cloning the repository and running the following command:
git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
uv venv dev
source dev/bin/activate
uv pip install -e .
You can check the environment install script and torch environment info to reproduce LLaVA-1.5's paper results. We found torch/cuda versions difference would cause small variations in the results, we provide the results check with different environments.
If you want to test on caption dataset such as coco
, refcoco
, and nocaps
, you will need to have java==1.8.0
to let pycocoeval api to work. If you don't have it, you can install by using conda
conda install openjdk=8
you can then check your java version by java -version
As demonstrated by the extensive table below, we aim to provide detailed information for readers to understand the datasets included in lmms-eval and some specific details about these datasets (we remain grateful for any corrections readers may have during our evaluation process).
We provide a Google Sheet for the detailed results of the LLaVA series models on different datasets. You can access the sheet here. It's a live sheet, and we are updating it with new results.
We also provide the raw data exported from Weights & Biases for the detailed results of the LLaVA series models on different datasets. You can access the raw data here.
If you want to test VILA, you should install the following dependencies:
pip install s2wrapper@git+https://github.com/bfshi/scaling_on_scales
Our Development will be continuing on the main branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub.
More examples can be found in examples/models
Evaluation of OpenAI-Compatible Model
bash examples/models/openai_compatible.sh
bash examples/models/xai_grok.sh
Evaluation of vLLM
bash examples/models/vllm_qwen2vl.sh
Evaluation of LLaVA-OneVision
bash examples/models/llava_onevision.sh
Evaluation of LLaMA-3.2-Vision
bash examples/models/llama_vision.sh
Evaluation of Qwen2-VL
bash examples/models/qwen2_vl.sh
bash examples/models/qwen2_5_vl.sh
Evaluation of LLaVA on MME
If you want to test LLaVA 1.5, you will have to clone their repo from LLaVA and
bash examples/models/llava_next.sh
Evaluation with tensor parallel for bigger model (llava-next-72b)
bash examples/models/tensor_parallel.sh
Evaluation with SGLang for bigger model (llava-next-72b)
bash examples/models/sglang.sh
Evaluation with vLLM for bigger model (llava-next-72b)
bash examples/models/vllm_qwen2vl.sh
More Parameters
python3 -m lmms_eval --help
Environmental Variables Before running experiments and evaluations, we recommend you to export following environment variables to your environment. Some are necessary for certain tasks to run.
export OPENAI_API_KEY="<YOUR_API_KEY>"
export HF_HOME="<Path to HF cache>"
export HF_TOKEN="<YOUR_API_KEY>"
export HF_HUB_ENABLE_HF_TRANSFER="1"
export REKA_API_KEY="<YOUR_API_KEY>"
# Other possible environment variables include
# ANTHROPIC_API_KEY,DASHSCOPE_API_KEY etc.
Common Environment Issues
Sometimes you might encounter some common issues for example error related to httpx or protobuf. To solve these issues, you can first try
python3 -m pip install httpx==0.23.3;
python3 -m pip install protobuf==3.20;
# If you are using numpy==2.x, sometimes may causing errors
python3 -m pip install numpy==1.26;
# Someties sentencepiece are required for tokenizer to work
python3 -m pip install sentencepiece;
Please refer to our documentation.
lmms_eval is a fork of lm-eval-harness. We recommend you to read through the docs of lm-eval-harness for relevant information.
Below are the changes we made to the original API:
@misc{zhang2024lmmsevalrealitycheckevaluation,
title={LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models},
author={Kaichen Zhang and Bo Li and Peiyuan Zhang and Fanyi Pu and Joshua Adrian Cahyono and Kairui Hu and Shuai Liu and Yuanhan Zhang and Jingkang Yang and Chunyuan Li and Ziwei Liu},
year={2024},
eprint={2407.12772},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.12772},
}
@misc{lmms_eval2024,
title={LMMs-Eval: Accelerating the Development of Large Multimoal Models},
url={https://github.com/EvolvingLMMs-Lab/lmms-eval},
author={Bo Li*, Peiyuan Zhang*, Kaichen Zhang*, Fanyi Pu*, Xinrun Du, Yuhao Dong, Haotian Liu, Yuanhan Zhang, Ge Zhang, Chunyuan Li and Ziwei Liu},
publisher = {Zenodo},
version = {v0.1.0},
month={March},
year={2024}
}
FAQs
Unknown package
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
/Security News
A flawed sandbox in @nestjs/devtools-integration lets attackers run code on your machine via CSRF, leading to full Remote Code Execution (RCE).
Product
Customize license detection with Socket’s new license overlays: gain control, reduce noise, and handle edge cases with precision.
Product
Socket now supports Rust and Cargo, offering package search for all users and experimental SBOM generation for enterprise projects.