🚀 Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more →

eval-mm

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

eval-mm

eval-mm is a tool for evaluating Multi-Modal Large Language Models.

0.4.0

PyPI

Maintainers: 1

llm-jp-eval-mm

[ Japanese | English ]

llm-jp-eval-mm automates the evaluation of multi-modal large language models (VLMs) across various datasets, mainly focusing on Japanese tasks.

This tool supports multi-modal text generation tasks and calculates task-specific evaluation metrics based on the inference results provided by users.

What llm-jp-eval-mm provides

Getting Started

You can install llm-jp-eval-mm from GitHub or via PyPI.

Option 1: Clone from GitHub (Recommended)

git clone git@github.com:llm-jp/llm-jp-eval-mm.git
cd llm-jp-eval-mm
uv sync

Option 2: Install via PyPI

pip install eval_mm

This tool uses the LLM-as-a-judge method for evaluation, which sends requests to GPT-4o via the OpenAI API. You need to configure the API keys in a .env file:

For Azure:AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_KEY
For OpenAI: OPENAI_API_KEY

If you're not using the LLM-as-a-judge method, you can set any value in the .env file to bypass the error.

How to Evaluate

Running an Evaluation

To evaluate your model on a specific task, we provide an example script: examples/sample.py.

For example, to evaluate the llava-hf/llava-1.5-7b-hf model on the japanese-heron-bench task, run:

uv sync --group normal
uv run --group normal python examples/sample.py \
  --model_id llava-hf/llava-1.5-7b-hf \
  --task_id japanese-heron-bench  \
  --result_dir result  \
  --metrics "heron-bench" \
  --judge_model "gpt-4o-2024-11-20" \
  --overwrite

The evaluation results will be saved in the result directory:

├── japanese-heron-bench
│   ├── llava-hf
│   │   ├── llava-1.5-7b-hf
│   │   │   ├── evaluation.jsonl
│   │   │   └── prediction.jsonl

If you want to evaluate multiple models on multiple tasks, please check eval_all.sh.

Use llm-jp-eval-mm as a Library

You can also integrate llm-jp-eval-mm into your own code. Here's an example:

from PIL import Image
from eval_mm import TaskRegistry, ScorerRegistry, ScorerConfig

class MockVLM:
    def generate(self, images: list[Image.Image], text: str) -> str:
        return "宮崎駿"

task = TaskRegistry.load_task("japanese-heron-bench")
example = task.dataset[0]

input_text = task.doc_to_text(example)
images = task.doc_to_visual(example)
reference = task.doc_to_answer(example)

model = MockVLM()
prediction = model.generate(images, input_text)

scorer = ScorerRegistry.load_scorer(
    "rougel",
    ScorerConfig(docs=task.dataset)
)
result = scorer.aggregate(scorer.score([reference], [prediction]))
print(result)
# AggregateOutput(overall_score=5.128205128205128, details={'rougel': 5.128205128205128})

Leaderboard

To generate a leaderboard from your evaluation results, run:

python scripts/make_leaderboard.py --result_dir result

This will create a leaderboard.md file with your model performance:

Model	Heron/LLM	JVB-ItW/LLM	JVB-ItW/Rouge
llm-jp/llm-jp-3-vila-14b	68.03	4.08	52.4
Qwen/Qwen2.5-VL-7B-Instruct	70.29	4.28	29.63
google/gemma-3-27b-it	69.15	4.36	30.89
microsoft/Phi-4-multimodal-instruct	45.52	3.2	26.8
gpt-4o-2024-11-20	93.7	4.44	32.2

The official leaderboard is available here

Supported Tasks

Currently, the following benchmark tasks are supported:

Japanese Tasks:

English Tasks:

Required Libraries for Each VLM Model Inference

Each VLM model may have different dependencies. To manage these, llm-jp-eval-mm uses uv's dependency groups.

For example, to use llm-jp/llm-jp-3-vila-14b, run:

uv sync --group vilaja
uv run --group vilaja python examples/VILA_ja.py

Refer to eval_all.sh for a full list of model dependencies.

When you add a new group, don’t forget to configure conflict.

Benchmark-Specific Required Libraries

JIC-VQA

For the JIC-VQA dataset, you need to download images from URLs. Use the following script to prepare the dataset:

python scripts/prepare_jic_vqa.py

Analyze VLMs Prediction

Visualize your model’s predictions with the following Streamlit app:

uv run streamlit run scripts/browse_prediction.py --task_id "japanese-heron-bench" --result_dir "result"

You will be able to see the visualized predictions, like this: Streamlit

Contribution

We welcome contributions! If you encounter issues, or if you have suggestions or improvements, please open an issue or submit a pull request.

How to Add a Benchmark Task

Refer to the src/eval_mm/tasks directory to implement new benchmark tasks.

How to Add a Metric

To add new metrics, implement them in the Scorer class. The code for existing scorers can be found in src/eval_mm/metrics.

How to Add Inference Code for a VLM Model

Implement the inference code for VLM models in the VLM class. For reference, check examples/base_vlm.py.

How to Add Dependencies

To add a new dependency, run:

uv add <package_name>
uv add --group <group_name> <package_name>

Testing

Run the following commands to test the task classes and metrics and to test the VLM models:

bash test.sh
bash test_model.sh

Formatting and Linting with Ruff

uv run ruff format src
uv run ruff check --fix src

Releasing to PyPI

To release a new version to PyPI:

git tag -a v0.x.x -m "version 0.x.x"
git push origin --tags

Updating the Website

For website updates, refer to the github_pages/README.md.

Acknowledgements

Heron: We refer to the Heron code for the evaluation of the Japanese Heron Bench task.
lmms-eval: We refer to the lmms-eval code for the evaluation of the JMMMU and MMMU tasks.

We also thank the developers of the evaluation datasets for their hard work.

FAQs

What is eval-mm?

Is eval-mm well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

eval-mm

llm-jp-eval-mm

Table of Contents

Getting Started

How to Evaluate

Running an Evaluation

Use llm-jp-eval-mm as a Library

Leaderboard

Supported Tasks

Required Libraries for Each VLM Model Inference

Benchmark-Specific Required Libraries

Analyze VLMs Prediction

Contribution

How to Add a Benchmark Task

How to Add a Metric

How to Add Inference Code for a VLM Model

How to Add Dependencies

Testing

Formatting and Linting with Ruff

Releasing to PyPI

Updating the Website

Acknowledgements

Related posts

eval-mm

llm-jp-eval-mm

Table of Contents

Getting Started

How to Evaluate

Running an Evaluation

Use llm-jp-eval-mm as a Library

Leaderboard

Supported Tasks

Required Libraries for Each VLM Model Inference

Benchmark-Specific Required Libraries

Analyze VLMs Prediction

Contribution

How to Add a Benchmark Task

How to Add a Metric

How to Add Inference Code for a VLM Model

How to Add Dependencies

Testing

Formatting and Linting with Ruff

Releasing to PyPI

Updating the Website

Acknowledgements

Related posts

CISA Kills Off RSS Feeds for KEVs and Cyber Alerts

MCP Community Begins Work on Official MCP Metaregistry