
Security News
Deno 2.6 + Socket: Supply Chain Defense In Your CLI
Deno 2.6 introduces deno audit with a new --socket flag that plugs directly into Socket to bring supply chain security checks into the Deno CLI.
pearmut
Advanced tools
Platform for Evaluation and Reviewing of Multilingual Tasks: Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols (DA, ESA, ESAAI, MQM, and more!).
Install and run locally without cloning:
pip install pearmut
# Download example campaigns
wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/esa.json
wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/da.json
# Load and start
pearmut add esa.json da.json
pearmut run
Campaigns are defined in JSON files (see examples/). The simplest configuration uses task-based assignment where each user has pre-defined tasks:
{
"info": {
"assignment": "task-based",
# DA: scores
# ESA: error spans and scores
# MQM: error spans, categories, and scores
"protocol": "ESA",
},
"campaign_id": "wmt25_#_en-cs_CZ",
"data": [
# data for first task/user
[
[
# each evaluation item is a document
{
"instructions": "Evaluate translation from en to cs_CZ", # message to show to users above the first item
"src": "This will be the year that Guinness loses its cool. Cheers to that!",
"tgt": {"modelA": "NevĂm pĆesnÄ, kdy jsem to poprvĂ© zaznamenal. MoĆŸnĂĄ to bylo ve chvĂli, ..."}
},
{
"src": "I'm not sure I can remember exactly when I sensed it. Maybe it was when some...",
"tgt": {"modelA": "Tohle bude rok, kdy Guinness pĆijde o svĆŻj âcoolâ faktor. Na zdravĂ!"}
}
...
],
# more document
...
],
# data for second task/user
[
...
],
# arbitrary number of users (each corresponds to a single URL to be shared)
]
}
Task items are protocol-specific. For ESA/DA/MQM protocols, each item is a dictionary representing a document unit:
[
{
"src": "A najednou se vĆĄechna tato voda naplnila dalĆĄĂmi lidmi a dalĆĄĂmi vÄcmi.", # required
"tgt": {"modelA": "And suddenly all the water became full of other people and other people."} # required (dict)
},
{
"src": "toto je pokraÄovĂĄnĂ stejnĂ©ho dokumentu",
"tgt": {"modelA": "this is a continuation of the same document"}
# Additional keys stored for analysis
}
]
Load campaigns and start the server:
pearmut add my_campaign.json # Use -o/--overwrite to replace existing
pearmut run
task-based: Each user has predefined itemssingle-stream: All users draw from a shared pool (random assignment)dynamic: work in progress â ïžBy default, Pearmut randomly shuffles the order in which models are shown per each item in order to avoid positional bias.
The shuffle parameter in campaign info controls this behavior:
{
"info": {
"assignment": "task-based",
"protocol": "ESA",
"shuffle": true # Default: true. Set to false to disable shuffling.
},
"campaign_id": "my_campaign",
"data": [...]
}
Include error_spans to pre-fill annotations that users can review, modify, or delete:
{
"src": "The quick brown fox jumps over the lazy dog.",
"tgt": {"modelA": "RychlĂĄ hnÄdĂĄ liĆĄka skĂĄÄe pĆes lĂnĂ©ho psa."},
"error_spans": {
"modelA": [
{
"start_i": 0, # character index start (inclusive)
"end_i": 5, # character index end (inclusive)
"severity": "minor", # "minor", "major", "neutral", or null
"category": null # MQM category string or null
},
{
"start_i": 27,
"end_i": 32,
"severity": "major",
"category": null
}
]
}
}
The error_spans field is a 2D array (one per candidate). See examples/esaai_prefilled.json.
Add validation rules for tutorials or attention checks:
{
"src": "The quick brown fox jumps.",
"tgt": {"modelA": "RychlĂĄ hnÄdĂĄ liĆĄka skĂĄÄe."},
"validation": {
"modelA": [
{
"warning": "Please set score between 70-80.", # shown on failure (omit for silent logging)
"score": [70, 80], # required score range [min, max]
"error_spans": [{"start_i": [0, 2], "end_i": [4, 8], "severity": "minor"}], # expected spans
"allow_skip": true # show "skip tutorial" button
}
]
}
}
Types:
allow_skip: true and warning to let users skip after feedbackwarning without allow_skip to force retrywarning to log failures without notification (quality control)The validation field is an array (one per candidate). Dashboard shows â
/â based on validation_threshold in info (integer for max failed count, float [0,1) for max proportion, default 0).
Score comparison: Use score_greaterthan to ensure one candidate scores higher than another:
{
"src": "AI transforms industries.",
"tgt": {"A": "UI transformuje prĆŻmysly.", "B": "UmÄlĂĄ inteligence mÄnĂ obory."},
"validation": {
"A": [
{"warning": "A has error, score 20-40.", "score": [20, 40]}
],
"B": [
{"warning": "B is correct and must score higher than A.", "score": [70, 90], "score_greaterthan": "A"}
]
}
}
The score_greaterthan field specifies the index of the candidate that must have a lower score than the current candidate.
See examples/tutorial_kway.json.
All annotators draw from a shared pool with random assignment:
{
"campaign_id": "my campaign 6",
"info": {
"assignment": "single-stream",
# DA: scores
# MQM: error spans and categories
# ESA: error spans and scores
"protocol": "ESA",
"users": 50, # number of annotators (can also be a list, see below)
},
"data": [...], # list of all items (shared among all annotators)
}
The users field accepts:
50): Generate random user IDs["alice", "bob"]): Use specific user IDs{
"info": {
...
"users": [
{"user_id": "alice", "token_pass": "alice_done", "token_fail": "alice_fail"},
{"user_id": "bob", "token_pass": "bob_done"} # missing tokens are auto-generated
],
},
...
}
Support for HTML-compatible elements (YouTube embeds, <video> tags, images). Ensure elements are pre-styled. See examples/multimodal.json.
Host local assets (audio, images, videos) using the assets key:
{
"campaign_id": "my_campaign",
"info": {
"assets": {
"source": "videos", # Source directory
"destination": "assets/my_videos" # Mount path (must start with "assets/")
}
},
"data": [ ... ]
}
Files from videos/ become accessible at localhost:8001/assets/my_videos/. Creates a symlink, so source directory must exist throughout annotation. Destination paths must be unique across campaigns.
pearmut add <file(s)>: Add campaign JSON files (supports wildcards)
-o/--overwrite: Replace existing campaigns with same ID--server <url>: Server URL prefix (default: http://localhost:8001)pearmut run: Start server
--port <port>: Server port (default: 8001)--server <url>: Server URL prefixpearmut purge [campaign]: Remove campaign data
Management link (shown when adding campaigns or running server) provides:
Completion tokens are shown at annotation end for verification (download correct tokens from dashboard). Incorrect tokens can be shown if quality control fails.
When tokens are supplied, the dashboard will try to show model rankings based on the names in the dictionaries.
"wmt25_#_en-cs_CZ"). Used to reference and manage specific campaigns. Typically a campaign is created for a specific language and domain.token_pass): Shown when user meets validation thresholdstoken_fail): Shown when user fails to meet validation requirementsallow_skip: true to let users skip if they have seen it before."GPT-4", "Claude"). Used for tracking and ranking model performance.minor, major)basic template supports comparing multiple outputs simultaneously.Server responds to data-only requests from frontend (no template coupling). Frontend served from pre-built static/ on install.
cd pearmut
# Frontend (separate terminal, recompiles on change)
npm install web/ --prefix web/
npm run build --prefix web/
# optionally keep running indefinitely to auto-rebuild
npm run watch --prefix web/
# Install as editable
pip3 install -e .
# Load examples
pearmut add examples/wmt25_#_en-cs_CZ.json examples/wmt25_#_cs-de_DE.json
pearmut run
web/srcwebpack.config.jsinfo->template in campaign JSONSee web/src/basic.ts for example.
Run on public server or tunnel local port to public IP/domain and run locally.
If you use this work in your paper, please cite as following.
@misc{zouhar2025pearmut,
author={Vilém Zouhar},
title={Pearmut: Platform for Evaluating and Reviewing of Multilingual Tasks},
url={https://github.com/zouharvi/pearmut/},
year={2026},
}
Contributions are welcome! Please reach out to Vilém Zouhar.
FAQs
A tool for evaluation of model outputs, primarily MT.
We found that pearmut demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Deno 2.6 introduces deno audit with a new --socket flag that plugs directly into Socket to bring supply chain security checks into the Deno CLI.

Security News
New DoS and source code exposure bugs in React Server Components and Next.js: whatâs affected and how to update safely.

Security News
Socket CEO Feross Aboukhadijeh joins Software Engineering Daily to discuss modern software supply chain attacks and rising AI-driven security risks.