![Oracle Drags Its Feet in the JavaScript Trademark Dispute](https://cdn.sanity.io/images/cgdhsj6q/production/919c3b22c24f93884c548d60cbb338e819ff2435-1024x1024.webp?w=400&fit=max&auto=format)
Security News
Oracle Drags Its Feet in the JavaScript Trademark Dispute
Oracle seeks to dismiss fraud claims in the JavaScript trademark dispute, delaying the case and avoiding questions about its right to the name.
[ Japanese | English ]
This tool automatically evaluates Japanese multi-modal large language models across multiple datasets. It offers the following features:
For details on the data format and the list of supported data, please check DATASET.md.
You can also use this tool via PyPI.
Use the pip
command to include eval_mm
in the virtual environment where you want to run it:
pip install eval_mm
This tool uses the LLM-as-a-judge method for evaluation, which sends requests to GPT-4o via the OpenAI API. Please create a .env
file and set AZURE_OPENAI_ENDPOINT
and AZURE_OPENAI_KEY
if you’re using Azure, or OPENAI_API_KEY
if you’re using the OpenAI API.
That’s it for environment setup.
If you prefer to clone the repository and use it, please follow the instructions below.
eval-mm
uses uv
to manage virtual environments.
Clone the repository and move into it:
git clone git@github.com:llm-jp/llm-jp-eval-mm.git
cd llm-jp-eval-mm
Build the environment with uv
.
Please install uv
by referring to the official doc.
cd llm-jp-eval-mm
uv sync
Following the sample .env.sample, create a .env
file and set AZURE_OPENAI_ENDPOINT
and AZURE_OPENAI_KEY
, or OPENAI_API_KEY
.
That’s all you need for the setup.
(Currently, the llm-jp-eval-mm repository is private. You can download the examples
directory from the Source Distribution at https://pypi.org/project/eval-mm/#files.)
We provide a sample code examples/sample.py
for running an evaluation.
Models listed as examples/{model_name}.py
are supported only in terms of their inference method.
If you want to run an evaluation on a new inference method or a new model, create a similar file referencing existing examples/{model_name}.py
, and you can run the evaluation in the same way.
For example, if you want to evaluate the llava-hf/llava-1.5-7b-hf
model on the japanese-heron-bench task, run the following command:
uv sync --group normal
uv run --group normal python examples/sample.py \
--model_id llava-hf/llava-1.5-7b-hf \
--task_id japanese-heron-bench \
--result_dir test \
--metrics "llm_as_a_judge_heron_bench" \
--judge_model "gpt-4o-2024-05-13" \
--overwrite
The evaluation score and output results will be saved in
test/{task_id}/evaluation/{model_id}.jsonl
and test/{task_id}/prediction/{model_id}.jsonl
.
If you want to evaluate multiple models on multiple tasks, please check eval_all.sh
.
We plan to publish a leaderboard summarizing the evaluation results of major models soon.
Right now, the following benchmark tasks are supported:
Japanese Task:
English Task:
Different models require different libraries. In this repository, we use uv’s Dependency groups to manage the libraries needed for each model.
When using the following models, please specify the normal
group:
stabiliyai/japanese-instructblip-alpha, stabilityai/japanese-stable-vlm, cyberagent/llava-calm2-siglip, llava-hf/llava-1.5-7b-hf, llava-hf/llava-v1.6-mistral-7b-hf, neulab/Pangea-7B-hf, meta-llama/Llama-3.2-11B-Vision-Instruct, meta-llama/Llama-3.2-90B-Vision-Instruct, OpenGVLab/InternVL2-8B, Qwen/Qwen2-VL-7B-Instruct, OpenGVLab/InternVL2-26B, Qwen/Qwen2-VL-72B-Instruct, gpt-4o-2024-05-13
uv sync --group normal
When using the following model, please specify the evovlm
group:
SamanaAI/Llama-3-EvoVLM-JP-v2
uv sync --group evovlm
When using the following models, please specify the vilaja
group:
llm-jp/llm-jp-3-vila-14b, Efficient-Large-Model/VILA1.5-13b
uv sync --group vilaja
mistralai/Pixtral-12B-2409
uv sync --group pixtral
When running the script, make sure to specify the group:
$ uv run --group normal python ...
If you add a new group, don’t forget to configure conflict.
JDocQA For constructing the JDocQA dataset, you need the pdf2image library. Since pdf2image depends on poppler-utils, please install it with:
sudo apt-get install poppler-utils
This repository is licensed under the Apache-2.0 License. For the licenses of each evaluation dataset, please see DATASET.md.
Tasks are defined in the Task
class.
Please reference the code in src/eval_mm/tasks and implement your Task
class. You’ll need methods to convert the dataset into a format for input to the VLM model, and methods to calculate the score.
Metrics are defined in the Scorer
class.
Please reference the code in src/eval_mm/metrics and implement your Scorer
class. You’ll need to implement a score()
method for sample-level scoring comparing references and generated outputs, and an aggregate()
method for population-level metric calculation.
Inference code for VLM models is defined in the VLM
class.
Please reference examples/base_vlm and implement your VLM
class. You’ll need a generate()
method to produce output text from images and prompts.
uv add <package_name>
uv add --group <group_name> <package_name>
uv run ruff format src
uv run ruff check --fix src
git tag -a v0.x.x -m "version 0.x.x"
git push origin --tags
Or you can manually create a new release on GitHub.
Please refer to github_pages/README.md.
We also thank the developers of the evaluation datasets for their hard work.
FAQs
eval-mm is a tool for evaluating Multi-Modal Large Language Models.
We found that eval-mm demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Oracle seeks to dismiss fraud claims in the JavaScript trademark dispute, delaying the case and avoiding questions about its right to the name.
Security News
The Linux Foundation is warning open source developers that compliance with global sanctions is mandatory, highlighting legal risks and restrictions on contributions.
Security News
Maven Central now validates Sigstore signatures, making it easier for developers to verify the provenance of Java packages.