New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More →

eval-mm

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

eval-mm

eval-mm is a tool for evaluating Multi-Modal Large Language Models.

0.1.2
PyPI

Maintainers: 1

LLM-jp-eval-mm

[ Japanese | English ]

This tool automatically evaluates Japanese multi-modal large language models across multiple datasets. It offers the following features:

Uses existing Japanese evaluation data and converts it into multi-modal text generation tasks for evaluation.
Calculates task-specific evaluation metrics using inference results created by users.

What llm-jp-eval-mm provides

For details on the data format and the list of supported data, please check DATASET.md.

Environment Setup

You can also use this tool via PyPI.

Install via PyPI

Use the pip command to include eval_mm in the virtual environment where you want to run it:
```
pip install eval_mm
```
This tool uses the LLM-as-a-judge method for evaluation, which sends requests to GPT-4o via the OpenAI API. Please create a .env file and set AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_KEY if you’re using Azure, or OPENAI_API_KEY if you’re using the OpenAI API.

That’s it for environment setup.

If you prefer to clone the repository and use it, please follow the instructions below.

Clone the GitHub Repo

eval-mm uses uv to manage virtual environments.

Clone the repository and move into it:

git clone git@github.com:llm-jp/llm-jp-eval-mm.git
cd llm-jp-eval-mm

Build the environment with uv.

Please install uv by referring to the official doc.
```
cd llm-jp-eval-mm
uv sync
```
Following the sample .env.sample, create a .env file and set AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_KEY, or OPENAI_API_KEY.

That’s all you need for the setup.

How to Evaluate

Running an Evaluation

(Currently, the llm-jp-eval-mm repository is private. You can download the examples directory from the Source Distribution at https://pypi.org/project/eval-mm/#files.)

We provide a sample code examples/sample.py for running an evaluation.

Models listed as examples/{model_name}.py are supported only in terms of their inference method.

If you want to run an evaluation on a new inference method or a new model, create a similar file referencing existing examples/{model_name}.py, and you can run the evaluation in the same way.

For example, if you want to evaluate the llava-hf/llava-1.5-7b-hf model on the japanese-heron-bench task, run the following command:

uv sync --group normal
uv run --group normal python examples/sample.py \
  --model_id llava-hf/llava-1.5-7b-hf \
  --task_id japanese-heron-bench  \
  --result_dir test  \
  --metrics "llm_as_a_judge_heron_bench" \
  --judge_model "gpt-4o-2024-05-13" \
  --overwrite

The evaluation score and output results will be saved in test/{task_id}/evaluation/{model_id}.jsonl and test/{task_id}/prediction/{model_id}.jsonl.

If you want to evaluate multiple models on multiple tasks, please check eval_all.sh.

Leaderboard

We plan to publish a leaderboard summarizing the evaluation results of major models soon.

Supported Tasks

Right now, the following benchmark tasks are supported:

Japanese Task:

English Task:

Required Libraries for Each VLM Model Inference

Different models require different libraries. In this repository, we use uv’s Dependency groups to manage the libraries needed for each model.

When using the following models, please specify the normal group: stabiliyai/japanese-instructblip-alpha, stabilityai/japanese-stable-vlm, cyberagent/llava-calm2-siglip, llava-hf/llava-1.5-7b-hf, llava-hf/llava-v1.6-mistral-7b-hf, neulab/Pangea-7B-hf, meta-llama/Llama-3.2-11B-Vision-Instruct, meta-llama/Llama-3.2-90B-Vision-Instruct, OpenGVLab/InternVL2-8B, Qwen/Qwen2-VL-7B-Instruct, OpenGVLab/InternVL2-26B, Qwen/Qwen2-VL-72B-Instruct, gpt-4o-2024-05-13

uv sync --group normal

When using the following model, please specify the evovlm group: SamanaAI/Llama-3-EvoVLM-JP-v2

uv sync --group evovlm

When using the following models, please specify the vilaja group: llm-jp/llm-jp-3-vila-14b, Efficient-Large-Model/VILA1.5-13b

uv sync --group vilaja

mistralai/Pixtral-12B-2409

uv sync --group pixtral

When running the script, make sure to specify the group:

$ uv run --group normal python ...

If you add a new group, don’t forget to configure conflict.

Benchmark-Specific Required Libraries

JDocQA For constructing the JDocQA dataset, you need the pdf2image library. Since pdf2image depends on poppler-utils, please install it with:
```
sudo apt-get install poppler-utils
```

License

This repository is licensed under the Apache-2.0 License. For the licenses of each evaluation dataset, please see DATASET.md.

Contribution

If you find any issues or have suggestions, please report them on the Issue tracker.
If you add new benchmark tasks, metrics, or VLM model inference code, or if you fix bugs, please send us a Pull Request.

How to Add a Benchmark Task

Tasks are defined in the Task class. Please reference the code in src/eval_mm/tasks and implement your Task class. You’ll need methods to convert the dataset into a format for input to the VLM model, and methods to calculate the score.

How to Add a Metric

Metrics are defined in the Scorer class. Please reference the code in src/eval_mm/metrics and implement your Scorer class. You’ll need to implement a score() method for sample-level scoring comparing references and generated outputs, and an aggregate() method for population-level metric calculation.

How to Add Inference Code for a VLM Model

Inference code for VLM models is defined in the VLM class. Please reference examples/base_vlm and implement your VLM class. You’ll need a generate() method to produce output text from images and prompts.

How to Add Dependencies

uv add <package_name>
uv add --group <group_name> <package_name>

Formatting and Linting with ruff

uv run ruff format src
uv run ruff check --fix src

How to Release to PyPI

git tag -a v0.x.x -m "version 0.x.x"
git push origin --tags

Or you can manually create a new release on GitHub.

How to Update the Website

Please refer to github_pages/README.md.

Acknowledgements

Heron: We refer to the Heron code for the evaluation of the Japanese Heron Bench task.
lmms-eval: We refer to the lmms-eval code for the evaluation of the JMMMU and MMMU tasks.

We also thank the developers of the evaluation datasets for their hard work.

FAQs

What is eval-mm?

Is eval-mm well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

eval-mm

LLM-jp-eval-mm

Table of Contents

Environment Setup

Install via PyPI

Clone the GitHub Repo

How to Evaluate

Running an Evaluation

Leaderboard

Supported Tasks

Required Libraries for Each VLM Model Inference

Benchmark-Specific Required Libraries

License

Contribution

How to Add a Benchmark Task

How to Add a Metric

How to Add Inference Code for a VLM Model

How to Add Dependencies

Formatting and Linting with ruff

How to Release to PyPI

How to Update the Website

Acknowledgements

Related posts

eval-mm

LLM-jp-eval-mm

Table of Contents

Environment Setup

Install via PyPI

Clone the GitHub Repo

How to Evaluate

Running an Evaluation

Leaderboard

Supported Tasks

Required Libraries for Each VLM Model Inference

Benchmark-Specific Required Libraries

License

Contribution

How to Add a Benchmark Task

How to Add a Metric

How to Add Inference Code for a VLM Model

How to Add Dependencies

Formatting and Linting with ruff

How to Release to PyPI

How to Update the Website

Acknowledgements

Related posts

Linux Foundation Warns Open Source Developers: Compliance with Sanctions Is Not Optional

Maven Central Adds Sigstore Signature Validation