🚀 Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more
Socket
Sign inDemoInstall
Socket

eval-mm

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

eval-mm

eval-mm is a tool for evaluating Multi-Modal Large Language Models.

0.4.0
PyPI
Maintainers
1

llm-jp-eval-mm

pypi Test workflow License

[ Japanese | English ]

llm-jp-eval-mm automates the evaluation of multi-modal large language models (VLMs) across various datasets, mainly focusing on Japanese tasks.

This tool supports multi-modal text generation tasks and calculates task-specific evaluation metrics based on the inference results provided by users.

What llm-jp-eval-mm provides

Table of Contents

Getting Started

You can install llm-jp-eval-mm from GitHub or via PyPI.

  • Option 1: Clone from GitHub (Recommended)
git clone git@github.com:llm-jp/llm-jp-eval-mm.git
cd llm-jp-eval-mm
uv sync
  • Option 2: Install via PyPI
pip install eval_mm

This tool uses the LLM-as-a-judge method for evaluation, which sends requests to GPT-4o via the OpenAI API. You need to configure the API keys in a .env file:

  • For Azure:AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_KEY
  • For OpenAI: OPENAI_API_KEY

If you're not using the LLM-as-a-judge method, you can set any value in the .env file to bypass the error.

How to Evaluate

Running an Evaluation

To evaluate your model on a specific task, we provide an example script: examples/sample.py.

For example, to evaluate the llava-hf/llava-1.5-7b-hf model on the japanese-heron-bench task, run:

uv sync --group normal
uv run --group normal python examples/sample.py \
  --model_id llava-hf/llava-1.5-7b-hf \
  --task_id japanese-heron-bench  \
  --result_dir result  \
  --metrics "heron-bench" \
  --judge_model "gpt-4o-2024-11-20" \
  --overwrite

The evaluation results will be saved in the result directory:

├── japanese-heron-bench
│   ├── llava-hf
│   │   ├── llava-1.5-7b-hf
│   │   │   ├── evaluation.jsonl
│   │   │   └── prediction.jsonl

If you want to evaluate multiple models on multiple tasks, please check eval_all.sh.

Use llm-jp-eval-mm as a Library

You can also integrate llm-jp-eval-mm into your own code. Here's an example:

from PIL import Image
from eval_mm import TaskRegistry, ScorerRegistry, ScorerConfig

class MockVLM:
    def generate(self, images: list[Image.Image], text: str) -> str:
        return "宮崎駿"

task = TaskRegistry.load_task("japanese-heron-bench")
example = task.dataset[0]

input_text = task.doc_to_text(example)
images = task.doc_to_visual(example)
reference = task.doc_to_answer(example)

model = MockVLM()
prediction = model.generate(images, input_text)

scorer = ScorerRegistry.load_scorer(
    "rougel",
    ScorerConfig(docs=task.dataset)
)
result = scorer.aggregate(scorer.score([reference], [prediction]))
print(result)
# AggregateOutput(overall_score=5.128205128205128, details={'rougel': 5.128205128205128})

Leaderboard

To generate a leaderboard from your evaluation results, run:

python scripts/make_leaderboard.py --result_dir result

This will create a leaderboard.md file with your model performance:

ModelHeron/LLMJVB-ItW/LLMJVB-ItW/Rouge
llm-jp/llm-jp-3-vila-14b68.034.0852.4
Qwen/Qwen2.5-VL-7B-Instruct70.294.2829.63
google/gemma-3-27b-it69.154.3630.89
microsoft/Phi-4-multimodal-instruct45.523.226.8
gpt-4o-2024-11-2093.74.4432.2

The official leaderboard is available here

Supported Tasks

Currently, the following benchmark tasks are supported:

Japanese Tasks:

English Tasks:

Required Libraries for Each VLM Model Inference

Each VLM model may have different dependencies. To manage these, llm-jp-eval-mm uses uv's dependency groups.

For example, to use llm-jp/llm-jp-3-vila-14b, run:

uv sync --group vilaja
uv run --group vilaja python examples/VILA_ja.py

Refer to eval_all.sh for a full list of model dependencies.

When you add a new group, don’t forget to configure conflict.

Benchmark-Specific Required Libraries

  • JIC-VQA

For the JIC-VQA dataset, you need to download images from URLs. Use the following script to prepare the dataset:

python scripts/prepare_jic_vqa.py

Analyze VLMs Prediction

Visualize your model’s predictions with the following Streamlit app:

uv run streamlit run scripts/browse_prediction.py --task_id "japanese-heron-bench" --result_dir "result"

You will be able to see the visualized predictions, like this: Streamlit

Contribution

We welcome contributions! If you encounter issues, or if you have suggestions or improvements, please open an issue or submit a pull request.

How to Add a Benchmark Task

Refer to the src/eval_mm/tasks directory to implement new benchmark tasks.

How to Add a Metric

To add new metrics, implement them in the Scorer class. The code for existing scorers can be found in src/eval_mm/metrics.

How to Add Inference Code for a VLM Model

Implement the inference code for VLM models in the VLM class. For reference, check examples/base_vlm.py.

How to Add Dependencies

To add a new dependency, run:

uv add <package_name>
uv add --group <group_name> <package_name>

Testing

Run the following commands to test the task classes and metrics and to test the VLM models:

bash test.sh
bash test_model.sh

Formatting and Linting with Ruff

uv run ruff format src
uv run ruff check --fix src

Releasing to PyPI

To release a new version to PyPI:

git tag -a v0.x.x -m "version 0.x.x"
git push origin --tags

Updating the Website

For website updates, refer to the github_pages/README.md.

Acknowledgements

  • Heron: We refer to the Heron code for the evaluation of the Japanese Heron Bench task.
  • lmms-eval: We refer to the lmms-eval code for the evaluation of the JMMMU and MMMU tasks.

We also thank the developers of the evaluation datasets for their hard work.

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts