🚹 Shai-Hulud Strikes Again:834 Packages Compromised.Technical Analysis →
Socket
Book a DemoInstallSign in
Socket

pearmut

Package Overview
Dependencies
Maintainers
1
Versions
29
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

pearmut

A tool for evaluation of model outputs, primarily MT.

pipPyPI
Version
0.2.11
Maintainers
1

Pearmut 🍐

Platform for Evaluation and Reviewing of Multilingual Tasks: Evaluate model outputs for translation and NLP tasks with support for multimodal data (text, video, audio, images) and multiple annotation protocols (DA, ESA, ESAAI, MQM, and more!).

PyPi version   PyPI download/month   PyPi license   build status

Screenshot of ESA/MQM interface

Table of Contents

Quick Start

Install and run locally without cloning:

pip install pearmut
# Download example campaigns
wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/esa.json
wget https://raw.githubusercontent.com/zouharvi/pearmut/refs/heads/main/examples/da.json
# Load and start
pearmut add esa.json da.json
pearmut run

Campaign Configuration

Basic Structure

Campaigns are defined in JSON files (see examples/). The simplest configuration uses task-based assignment where each user has pre-defined tasks:

{
  "info": {
    "assignment": "task-based",
    # DA: scores
    # ESA: error spans and scores
    # MQM: error spans, categories, and scores
    "protocol": "ESA", 
  },
  "campaign_id": "wmt25_#_en-cs_CZ",
  "data": [
    # data for first task/user
    [
      [
        # each evaluation item is a document
        {
          "instructions": "Evaluate translation from en to cs_CZ",  # message to show to users above the first item
          "src": "This will be the year that Guinness loses its cool. Cheers to that!",
          "tgt": {"modelA": "NevĂ­m pƙesně, kdy jsem to poprvĂ© zaznamenal. MoĆŸnĂĄ to bylo ve chvĂ­li, ..."}
        },
        {
          "src": "I'm not sure I can remember exactly when I sensed it. Maybe it was when some...",
          "tgt": {"modelA": "Tohle bude rok, kdy Guinness pƙijde o svƯj „cool“ faktor. Na zdraví!"}
        }
        ...
      ],
      # more document
      ...
    ],
    # data for second task/user
    [
        ...
    ],
    # arbitrary number of users (each corresponds to a single URL to be shared)
  ]
}

Task items are protocol-specific. For ESA/DA/MQM protocols, each item is a dictionary representing a document unit:

[
  {
    "src": "A najednou se vơechna tato voda naplnila dalơími lidmi a dalơími věcmi.",  # required
    "tgt": {"modelA": "And suddenly all the water became full of other people and other people."}  # required (dict)
  },
  {
    "src": "toto je pokračovĂĄnĂ­ stejnĂ©ho dokumentu",
    "tgt": {"modelA": "this is a continuation of the same document"}
    # Additional keys stored for analysis
  }
]

Load campaigns and start the server:

pearmut add my_campaign.json  # Use -o/--overwrite to replace existing
pearmut run

Assignment Types

  • task-based: Each user has predefined items
  • single-stream: All users draw from a shared pool (random assignment)
  • dynamic: work in progress ⚠

Advanced Features

Shuffling Model Translations

By default, Pearmut randomly shuffles the order in which models are shown per each item in order to avoid positional bias. The shuffle parameter in campaign info controls this behavior:

{
  "info": {
    "assignment": "task-based",
    "protocol": "ESA",
    "shuffle": true  # Default: true. Set to false to disable shuffling.
  },
  "campaign_id": "my_campaign",
  "data": [...]
}

Pre-filled Error Spans (ESAAI)

Include error_spans to pre-fill annotations that users can review, modify, or delete:

{
  "src": "The quick brown fox jumps over the lazy dog.",
  "tgt": {"modelA": "RychlĂĄ hnědĂĄ liĆĄka skáče pƙes lĂ­nĂ©ho psa."},
  "error_spans": {
    "modelA": [
      {
        "start_i": 0,         # character index start (inclusive)
        "end_i": 5,           # character index end (inclusive)
        "severity": "minor",  # "minor", "major", "neutral", or null
        "category": null      # MQM category string or null
      },
      {
        "start_i": 27,
        "end_i": 32,
        "severity": "major",
        "category": null
      }
    ]
  }
}

The error_spans field is a 2D array (one per candidate). See examples/esaai_prefilled.json.

Tutorial and Attention Checks

Add validation rules for tutorials or attention checks:

{
  "src": "The quick brown fox jumps.",
  "tgt": {"modelA": "Rychlá hnědá liơka skáče."},
  "validation": {
    "modelA": [
      {
        "warning": "Please set score between 70-80.",  # shown on failure (omit for silent logging)
        "score": [70, 80],                             # required score range [min, max]
        "error_spans": [{"start_i": [0, 2], "end_i": [4, 8], "severity": "minor"}],  # expected spans
        "allow_skip": true                             # show "skip tutorial" button
      }
    ]
  }
}

Types:

  • Tutorial: Include allow_skip: true and warning to let users skip after feedback
  • Loud attention checks: Include warning without allow_skip to force retry
  • Silent attention checks: Omit warning to log failures without notification (quality control)

The validation field is an array (one per candidate). Dashboard shows ✅/❌ based on validation_threshold in info (integer for max failed count, float [0,1) for max proportion, default 0).

Score comparison: Use score_greaterthan to ensure one candidate scores higher than another:

{
  "src": "AI transforms industries.",
  "tgt": {"A": "UI transformuje prƯmysly.", "B": "Umělá inteligence mění obory."},
  "validation": {
    "A": [
      {"warning": "A has error, score 20-40.", "score": [20, 40]}
    ],
    "B": [
      {"warning": "B is correct and must score higher than A.", "score": [70, 90], "score_greaterthan": "A"}
    ]
  }
}

The score_greaterthan field specifies the index of the candidate that must have a lower score than the current candidate. See examples/tutorial_kway.json.

Single-stream Assignment

All annotators draw from a shared pool with random assignment:

{
    "campaign_id": "my campaign 6",
    "info": {
        "assignment": "single-stream",
        # DA: scores
        # MQM: error spans and categories
        # ESA: error spans and scores
        "protocol": "ESA",
        "users": 50,                           # number of annotators (can also be a list, see below)
    },
    "data": [...], # list of all items (shared among all annotators)
}

Pre-defined User IDs and Tokens

The users field accepts:

  • Number (e.g., 50): Generate random user IDs
  • List of strings (e.g., ["alice", "bob"]): Use specific user IDs
  • List of dictionaries: Specify custom tokens:
{
    "info": {
        ...
        "users": [
            {"user_id": "alice", "token_pass": "alice_done", "token_fail": "alice_fail"},
            {"user_id": "bob", "token_pass": "bob_done"}  # missing tokens are auto-generated
        ],
    },
    ...
}

Multimodal Annotations

Support for HTML-compatible elements (YouTube embeds, <video> tags, images). Ensure elements are pre-styled. See examples/multimodal.json.

Preview of multimodal elements in Pearmut

Hosting Assets

Host local assets (audio, images, videos) using the assets key:

{
    "campaign_id": "my_campaign",
    "info": { 
      "assets": {
        "source": "videos",                    # Source directory
        "destination": "assets/my_videos"      # Mount path (must start with "assets/")
      }
    },
    "data": [ ... ]
}

Files from videos/ become accessible at localhost:8001/assets/my_videos/. Creates a symlink, so source directory must exist throughout annotation. Destination paths must be unique across campaigns.

CLI Commands

  • pearmut add <file(s)>: Add campaign JSON files (supports wildcards)
    • -o/--overwrite: Replace existing campaigns with same ID
    • --server <url>: Server URL prefix (default: http://localhost:8001)
  • pearmut run: Start server
    • --port <port>: Server port (default: 8001)
    • --server <url>: Server URL prefix
  • pearmut purge [campaign]: Remove campaign data
    • Without args: Purge all campaigns
    • With campaign name: Purge specific campaign only

Campaign Management

Management link (shown when adding campaigns or running server) provides:

  • Annotator progress overview
  • Access to annotation links
  • Task progress reset (data preserved)
  • Download progress and annotations
Management dashboard

Completion tokens are shown at annotation end for verification (download correct tokens from dashboard). Incorrect tokens can be shown if quality control fails.

Token on completion

When tokens are supplied, the dashboard will try to show model rankings based on the names in the dictionaries.

Terminology

  • Campaign: An annotation project that contains configuration, data, and user assignments. Each campaign has a unique identifier and is defined in a JSON file.
    • Campaign File: A JSON file that defines the campaign configuration, including the campaign ID, assignment type, protocol settings, and annotation data.
    • Campaign ID: A unique identifier for a campaign (e.g., "wmt25_#_en-cs_CZ"). Used to reference and manage specific campaigns. Typically a campaign is created for a specific language and domain.
  • Task: A unit of work assigned to a user. In task-based assignment, each task consists of a predefined set of items for a specific user.
  • Item: A single annotation unit within a task. For translation evaluation, an item typically represents a document (source text and target translation). Items can contain text, images, audio, or video.
  • Document: A collection of one or more segments (sentence pairs or text units) that are evaluated together as a single item.
  • User / Annotator: A person who performs annotations in a campaign. Each user is identified by a unique user ID and accesses the campaign through a unique URL.
  • Attention Check: A validation item with known correct answers used to ensure annotator quality. Can be:
    • Loud: Shows warning message and forces retry on failure
    • Silent: Logs failures without notifying the user (for quality control analysis)
    • Token: A completion code shown to users when they finish their annotations. Tokens verify the completion and whether the user passed quality control checks:
      • Pass Token (token_pass): Shown when user meets validation thresholds
      • Fail Token (token_fail): Shown when user fails to meet validation requirements
  • Tutorial: An instructional validation item that teaches users how to annotate. Includes allow_skip: true to let users skip if they have seen it before.
  • Validation: Quality control rules attached to items that check if annotations match expected criteria (score ranges, error span locations, etc.). Used for tutorials and attention checks.
  • Model: The system or model that generated the output being evaluated (e.g., "GPT-4", "Claude"). Used for tracking and ranking model performance.
  • Dashboard: The management interface that shows campaign progress, annotator statistics, access links, and allows downloading annotations. Accessed via a special management URL with token authentication.
  • Protocol: The annotation scheme defining what data is collected:
    • Score: Numeric quality rating (0-100)
    • Error Spans: Text highlights marking errors with severity (minor, major)
    • Error Categories: MQM taxonomy labels for errors
  • Template: The annotation interface type. The basic template supports comparing multiple outputs simultaneously.
  • Assignment: The method for distributing items to users:
    • Task-based: Each user has predefined items
    • Single-stream: Users draw from a shared pool with random assignment
    • Dynamic: Work in progress

Development

Server responds to data-only requests from frontend (no template coupling). Frontend served from pre-built static/ on install.

Local development:

cd pearmut
# Frontend (separate terminal, recompiles on change)
npm install web/ --prefix web/
npm run build --prefix web/
# optionally keep running indefinitely to auto-rebuild
npm run watch --prefix web/

# Install as editable
pip3 install -e .
# Load examples
pearmut add examples/wmt25_#_en-cs_CZ.json examples/wmt25_#_cs-de_DE.json
pearmut run

Creating new protocols:

  • Add HTML and TS files to web/src
  • Add build rule to webpack.config.js
  • Reference as info->template in campaign JSON

See web/src/basic.ts for example.

Deployment

Run on public server or tunnel local port to public IP/domain and run locally.

Misc.

If you use this work in your paper, please cite as following.

@misc{zouhar2025pearmut,
    author={Vilém Zouhar},
    title={Pearmut: Platform for Evaluating and Reviewing of Multilingual Tasks},
    url={https://github.com/zouharvi/pearmut/},
    year={2026},
}

Contributions are welcome! Please reach out to Vilém Zouhar.

Keywords

evaluation

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts