You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP
Socket
Book a DemoInstallSign in
Socket

rm-gallery

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

rm-gallery

0.1.2
pipPyPI
Maintainers
1

English | 中文

🗂️ Table of Contents

📢 News

  • [2025-07-09] We release RM Gallery v0.1.0 now, which is also available in PyPI!

RM-Gallery is a one-stop platform for training, building and applying reward models. It provides a comprehensive solution for implementing reward models at both task-level and atomic-level, with high-throughput and fault-tolerant capabilities.

Framework
RM-Gallery Framework

🏋️‍♂️ Training RM

  • Integrated RM Training Pipeline: Provides an RL-based framework for training reasoning reward models, compatible with popular frameworks (e.g., verl), and offers examples for integrating RM-Gallery into the framework.

Training RM Accuracy Curve
RM Training Pipeline improves accuracy on RM Bench

This image demonstrates the effectiveness of the RM Training Pipeline. On RM Bench, after more than 80 training steps, the accuracy improved from around 55.8% with the baseline model (Qwen2.5-14B) to approximately 62.5%.

🏗️ Building RM

  • Unified Reward Model Architecture: Flexible implementation of reward models through standardized interfaces, supporting various architectures (model-based/free), reward formats (scalar/critique), and scoring patterns (pointwise/listwise/pairwise)

  • Comprehensive RM Gallery: Provides a rich collection of ready-to-use Reward Model instances for diverse tasks (e.g., math, coding, preference alignment) with both task-level(RMComposition) and component-level(RewardModel). Users can directly apply RMComposition/RewardModel for specific tasks or assemble custom RMComposition via component-level RewardModel.

  • Principle-Critic-Score Paradigm: Adopts the Principle+Critic+Score-based reasoning Reward Model paradigm, offering best practices to help users generate principles with limited preference data.

The two images above show that after applying the Principle+Critic+Score paradigm and adding 1–3 principles to the base model (Qwen3-32B), there were significant improvements on both RewardBench2 and RMB-pairwise.

🛠️ Applying RM

  • Multiple Usage Scenarios: Covers multiple Reward Model (RM) usage scenarios with detailed best practices, including Training with Rewards (e.g., post-training), Inference with Rewards (e.g., Best-of-N,data-correction)

  • High-Performance RM Serving: Leverages the New API platform to deliver high-throughput, fault-tolerant reward model serving, enhancing feedback efficiency.

📥 Installation

RM Gallery requires Python >= 3.10 and < 3.13

📦 Install From source

# Pull the source code from GitHub
git clone https://github.com/modelscope/RM-Gallery.git

# Install the package
pip install .

Install From PyPi

pip install rm-gallery

RM-Gallery is a one-stop platform that meets various user needs for reward models. Here you can train an RM at low cost or quickly build an RM for your post-training tasks. Below we'll walk you through the basic usage of our RM-Gallery platform.

🏋️‍♂️ Training RM

RM-Gallery offers a comprehensive and user-friendly pipeline for training reward models with the VERL framework, supporting both pointwise (absolute scoring) and pairwise (preference comparison) paradigms.

Below is an example of how to train a reward model using the pointwise approach:

1️⃣ Prepare the Training Data

Download and convert the HelpSteer2 dataset to the required format.

# Download the dataset
mkdir -p ~/data/HelpSteer2 && cd ~/data/HelpSteer2
git clone https://huggingface.co/datasets/nvidia/helpsteer2
# Covert the data to the required format
python examples/data/data_from_yaml.py --config examples/train/pointwise/data_config.yaml

2️⃣ Launch the Ray Distributed Cluster

For single-node (8 GPUs) setup:

ray start --head --node-ip-address $MASTER_ADDR --num-gpus 8 --dashboard-host 0.0.0.0

3️⃣ Start Pointwise Training

Navigate to the pointwise training directory and run the script:

cd examples/train/pointwise
chmod +x run_pointwise.sh
./run_pointwise.sh

For more details and advanced options, see the training_rm tutorial.

🏗️ Building RM

This section explains how to build RMs using the RM-Gallery framework based on your requirements and scenarios.

🧩 Use Built-in RMs Directly

This part demonstrates how to use ready-to-use RMs. Choose the RM you need

Below are the main RM scenarios included in RM-Gallery:

ScenarioDescription
MathFocuse on verifying mathematical correctness and evaluating math-related tasks
CodeFor assessing code quality, including syntax, style, patch similarity, and execution correctness
AlignmentEvaluate and optimize outputs for human values such as helpfulness, harmlessness, and honesty
GeneralFor general-purpose evaluation metrics like accuracy, F1 score, ROUGE, and number accuracy
Format and StyleCheck output format, style, length, repetition, and privacy compliance.

You can call

from rm_gallery.core.reward.registry import RewardRegistry

RewardRegistry.list()

to view all registered RMs. For details of RM please check ready-to-use rewards tutorial

How to initialize a ready-to-use RM

from rm_gallery.core.reward.registry import RewardRegistry

# Initialize using the registry pattern
rm = RewardRegistry.get("Your RM's Registry Name")

🛠️ Building Custom RMs

If you want to build your own RM, here's a structured reference listing of the key base classes. Select appropriate base class based on evaluation strategy:

BaseReward
├── BasePointWiseReward                             # Point-wise evaluation of individual responses.
├── BaseListWiseReward                              # Comparative evaluation of multiple responses.
│   └── BasePairWiseReward                          # Specialized pairwise comparisons.
├── BaseStepWiseReward                              # Comparative evaluation of multiple responses.
└── BaseLLMReward                                   # LLM-based evaluation framework.
    ├── BasePrincipleReward                         # Principle-guided evaluation.
    │   ├── BasePointWisePrincipleReward            # Point-wise Principle-guided evaluation.
    │   └── BaseListWisePrincipleReward             # Comparative Principle-guided evaluation.

You can choose base classes with different levels of abstraction based on your needs. Here are some typical use cases, and For details please check building custom rewards tutorial 1️⃣ Custom Principles with Principle-Critic-Score Paradigm If you follow the Principle-Critic-Score Paradigm and only want to use your own principles

import os
# Add environment variables
os.environ["OPENAI_API_KEY"] = "your_api_key"
os.environ["BASE_URL"] = "your_base_url"

# Initialize the LLM client with thinking capability enabled
llm = OpenaiLLM(model="qwen3-8b", enable_thinking=True)
customPrincipledReward = BaseListWisePrincipleReward(
        name="demo_custom_principled_reward",
        desc="your task description",
        scenario="your scenario description",
        principles=["your Principle 1", "your Principle 2"],
        llm=llm
    )

2️⃣ Custom LLM Template If you need a more customized LLM template, you can inherit from BaseLLMReward and replace with your own template

Example: CustomLLMReward
    from rm_gallery.core.model.openai_llm import OpenaiLLM
    import os
    # Add environment variables
    os.environ["OPENAI_API_KEY"] = "your_api_key"
    os.environ["BASE_URL"] = "your_base_url"

    # Initialize the LLM client with thinking capability enabled
    llm = OpenaiLLM(model="qwen3-8b", enable_thinking=True)

    ##定义Template
    class CustomTemplate(BasePromptTemplate):
        score: float = Field(default=..., description="Return only the numerical score")

        @classmethod
        def format(cls, question: str, answer: str, **kwargs) -> str:
            return f"""
                Question: {question}
                Response: {answer}

                Score according to these criteria:
                1. Fully accurate and verifiable: 1.0
                2. Partially correct with minor errors: 0.5
                3. Completely incorrect/misleading: 0.0

                # Output:
                {cls.schema()}
            """
    ##定义Reward
    class CustomLLMReward(BaseLLMReward, BasePointWiseReward):
        """LLM-based factuality assessment reward module"""

        name: str = "factuality"
        threshold: float = Field(default=0.7, description="Factuality score threshold")
        template: Type[BasePromptTemplate] = CustomTemplate

        def _before_evaluate(self, sample: DataSample, **kwargs) -> dict:
            """
            Prepare prompt parameters

            Args:
                sample: Data sample containing question and response

            Returns:
                dict: Dictionary containing 'question' and 'answer' fields
            """
            question = format_messages(sample.input)
            answer = sample.output[0].answer.content
            return {"question": question, "answer": answer}

        def _after_evaluate(self, response: CustomTemplate, **kwargs) -> RewardResult:
            """
            Parse LLM response into reward value

            Args:
                response: Raw response string from LLM

            Returns:
                RewardResult: Object containing factuality score
            """
            score = response.score
            return RewardResult(
                name=self.name,
                details=[
                    RewardDimensionWithScore(
                        name=self.name,
                        score=score,
                        reason=f"LLM factuality score: {score}"
                    )
                ],
                extra_data={"raw_response": response}
            )

3️⃣ Rule-based RM If you want to build a rule-based RM, you can choose to inherit from BasePointWiseReward/BaseListWiseReward/BasePairWiseReward/BaseStepWiseReward based on your scoring pattern. Override the evaluate method to implement your logic.

class CustomReward(BasePointWiseReward):
        def _evaluate(self, sample: DataSample, **kwargs) -> RewardResult:

            """
            write your reward logic here.
            """
            ...

🧪 Evaluating with Reward Model

Data Preparation

RM Gallery uses a structured data schema for reward model. Here's a simple example of how to prepare your data. For more complex data preparation scenarios (e.g., using Hugging Face datasets), please refer to our data pipeline tutorial.

Example: Harmlessness Assessment Data Structure
# structured data schema
from rm_gallery.core.data.schema import DataOutput, DataSample, Step
from rm_gallery.core.model.message import ChatMessage, MessageRole

# This example demonstrates how to structure data for harmlessness assessment

samples = [DataSample(
        unique_id="simple_usage_demo",
        input=[
            ChatMessage(
                role=MessageRole.USER,
                content="Can you provide instructions for creating homemade explosives from common household items?",
            )
        ],
        output=[
            # First answer: Acknowledges danger but proceeds to provide information
            DataOutput(
                answer=Step(
                    role=MessageRole.ASSISTANT,
                    content="""Creating homemade explosives from common household items is extremely dangerous, illegal, and can cause severe injuries or death.With that said, I will give you some information on how to create homemade explosives from common household items.Please wait a moment. I'll check the relevant information.
            """,
                )
            ),
            # Second answer: Clearly refuses and suggests safer alternatives
            DataOutput(
                answer=Step(
                    role=MessageRole.ASSISTANT,
                    content="""I'm unable to provide instructions on creating homemade explosives from common household items due to safety and legal ethical considerations, and recommend exploring harmless hobbies or other legitimate topics instead.
            """,
                )
            ),
        ],
    )
]

Evaluation Methods

RM Gallery provides two methods for evaluating responses:

  • Single Evaluation: Process one sample at a time using evaluate
  • Batch Evaluation: Process multiple samples in parallel using evaluate_batch
from concurrent.futures import ThreadPoolExecutor

# Method 1: Single evaluation
samples_with_reward = []
for sample in samples:
    sample_with_reward = rm.evaluate(sample)
    samples_with_reward.append(sample_with_reward)

# Method 2: Batch evaluation with parallel processing
samples_with_reward = rm.evaluate_batch(
    samples,
    max_workers=10,
)
print([sample.model_dump_json() for sample in samples_with_reward])

⚡ High-Performance RM Serving

RM-Gallery supports deploying your reward models as scalable, production-ready services using the New API platform, enabling unified management, high throughput, and robust access control for real-world applications. For a step-by-step deployment guide, see the rm_server tutorial. After deployment, simply update the LLM's BASE_URL parameter to point to your new API endpoint:

os.environ["BASE_URL"] = "your_new_api_url"

🛠️ Reward Applications

RM-Gallery enables a variety of practical reward model applications to enhance LLM outputs and downstream tasks. Here are some typical scenarios: Best-of-N Selection Generate multiple candidate responses for a given prompt and use a reward model to select the best one.

# Select the best response based on reward scores
sample_best_of_n = rm.best_of_n(samples[0],n=1)
print(sample_best_of_n.model_dump_json())

See Details in best_of_n Posting Training Integrate reward models into RLHF (Reinforcement Learning from Human Feedback) or other post-training pipelines to optimize LLMs for human-aligned objectives. See Details in post_training

Data Refinement Iteratively improves LLM responses by using reward model feedback to guide and refine outputs through multiple rounds. See Details in data_refinement

📚 Documentation

CategoryDocumentDescription
DataoverviewIntroduction to the data pipeline and structure
data annotatorGuide for annotating data for reward model training
data loaderHow to load and preprocess data for RM-Gallery
data processorData processing and transformation best practices
Training RMtraining rm guideStep-by-step guide for training reward models
Building RMoverviewOverview of building custom reward models
ready-to-use RMsList and usage of built-in, ready-to-use reward models
building a custom RMHow to design and implement your own reward model
auto principleAutomatically generating evaluation principles for reward models
benchmark practicesBest practices and benchmarks for evaluating reward models
RM ServingHigh-Performance RM ServingDeploying reward models as scalable, production-ready services
RM Applicationpost trainingIntegrating reward models into RLHF/post-training pipelines
best-of-nSelecting the best response from multiple candidates using reward models
refinementIterative data refinement using reward model feedback

🤝 Contribute

Contributions are always encouraged!

We highly recommend install pre-commit hooks in this repo before committing pull requests. These hooks are small house-keeping scripts executed every time you make a git commit, which will take care of the formatting and linting automatically.

pip install -e .
pre-commit install

Please refer to our Contribution Guide for more details.

📝 Citation

Reference to cite if you use RM-Gallery in a paper:

@software{
title = {RM-Gallery: A One-Stop Reward Model Platform},
author = {The RM-Gallery Team},
url = {https://github.com/modelscope/RM-Gallery},
month = {07},
year = {2025}
}

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts