
Security News
ESLint Adds Official Support for Linting HTML
ESLint now supports HTML linting with 48 new rules, expanding its language plugin system to cover more of the modern web development stack.
[ Japanese | English ]
llm-jp-eval-mm automates the evaluation of multi-modal large language models (VLMs) across various datasets, mainly focusing on Japanese tasks.
This tool supports multi-modal text generation tasks and calculates task-specific evaluation metrics based on the inference results provided by users.
You can install llm-jp-eval-mm from GitHub or via PyPI.
git clone git@github.com:llm-jp/llm-jp-eval-mm.git
cd llm-jp-eval-mm
uv sync
pip install eval_mm
This tool uses the LLM-as-a-judge method for evaluation, which sends requests to GPT-4o via the OpenAI API. You need to configure the API keys in a .env file:
AZURE_OPENAI_ENDPOINT
and AZURE_OPENAI_KEY
OPENAI_API_KEY
If you're not using the LLM-as-a-judge method, you can set any value in the .env file to bypass the error.
To evaluate your model on a specific task, we provide an example script: examples/sample.py
.
For example, to evaluate the llava-hf/llava-1.5-7b-hf
model on the japanese-heron-bench task, run:
uv sync --group normal
uv run --group normal python examples/sample.py \
--model_id llava-hf/llava-1.5-7b-hf \
--task_id japanese-heron-bench \
--result_dir result \
--metrics "heron-bench" \
--judge_model "gpt-4o-2024-11-20" \
--overwrite
The evaluation results will be saved in the result directory:
├── japanese-heron-bench
│ ├── llava-hf
│ │ ├── llava-1.5-7b-hf
│ │ │ ├── evaluation.jsonl
│ │ │ └── prediction.jsonl
If you want to evaluate multiple models on multiple tasks, please check eval_all.sh
.
You can also integrate llm-jp-eval-mm into your own code. Here's an example:
from PIL import Image
from eval_mm import TaskRegistry, ScorerRegistry, ScorerConfig
class MockVLM:
def generate(self, images: list[Image.Image], text: str) -> str:
return "宮崎駿"
task = TaskRegistry.load_task("japanese-heron-bench")
example = task.dataset[0]
input_text = task.doc_to_text(example)
images = task.doc_to_visual(example)
reference = task.doc_to_answer(example)
model = MockVLM()
prediction = model.generate(images, input_text)
scorer = ScorerRegistry.load_scorer(
"rougel",
ScorerConfig(docs=task.dataset)
)
result = scorer.aggregate(scorer.score([reference], [prediction]))
print(result)
# AggregateOutput(overall_score=5.128205128205128, details={'rougel': 5.128205128205128})
To generate a leaderboard from your evaluation results, run:
python scripts/make_leaderboard.py --result_dir result
This will create a leaderboard.md
file with your model performance:
Model | Heron/LLM | JVB-ItW/LLM | JVB-ItW/Rouge |
---|---|---|---|
llm-jp/llm-jp-3-vila-14b | 68.03 | 4.08 | 52.4 |
Qwen/Qwen2.5-VL-7B-Instruct | 70.29 | 4.28 | 29.63 |
google/gemma-3-27b-it | 69.15 | 4.36 | 30.89 |
microsoft/Phi-4-multimodal-instruct | 45.52 | 3.2 | 26.8 |
gpt-4o-2024-11-20 | 93.7 | 4.44 | 32.2 |
The official leaderboard is available here
Currently, the following benchmark tasks are supported:
Japanese Tasks:
English Tasks:
Each VLM model may have different dependencies. To manage these, llm-jp-eval-mm uses uv's dependency groups.
For example, to use llm-jp/llm-jp-3-vila-14b, run:
uv sync --group vilaja
uv run --group vilaja python examples/VILA_ja.py
Refer to eval_all.sh for a full list of model dependencies.
When you add a new group, don’t forget to configure conflict.
For the JIC-VQA dataset, you need to download images from URLs. Use the following script to prepare the dataset:
python scripts/prepare_jic_vqa.py
Visualize your model’s predictions with the following Streamlit app:
uv run streamlit run scripts/browse_prediction.py --task_id "japanese-heron-bench" --result_dir "result"
You will be able to see the visualized predictions, like this:
We welcome contributions! If you encounter issues, or if you have suggestions or improvements, please open an issue or submit a pull request.
Refer to the src/eval_mm/tasks
directory to implement new benchmark tasks.
To add new metrics, implement them in the Scorer class. The code for existing scorers can be found in src/eval_mm/metrics
.
Implement the inference code for VLM models in the VLM class. For reference, check examples/base_vlm.py
.
To add a new dependency, run:
uv add <package_name>
uv add --group <group_name> <package_name>
Run the following commands to test the task classes and metrics and to test the VLM models:
bash test.sh
bash test_model.sh
uv run ruff format src
uv run ruff check --fix src
To release a new version to PyPI:
git tag -a v0.x.x -m "version 0.x.x"
git push origin --tags
For website updates, refer to the github_pages/README.md.
We also thank the developers of the evaluation datasets for their hard work.
FAQs
eval-mm is a tool for evaluating Multi-Modal Large Language Models.
We found that eval-mm demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
ESLint now supports HTML linting with 48 new rules, expanding its language plugin system to cover more of the modern web development stack.
Security News
CISA is discontinuing official RSS support for KEV and cybersecurity alerts, shifting updates to email and social media, disrupting automation workflows.
Security News
The MCP community is launching an official registry to standardize AI tool discovery and let agents dynamically find and install MCP servers.