English | 中文
RM-Gallery: A One-Stop Reward Model Platform

🗂️ Table of Contents
📢 News
- [2025-07-09] We release RM Gallery v0.1.0 now, which is also available in PyPI!
🌟 Why RM-Gallery?
RM-Gallery is a one-stop platform for training, building and applying reward models. It provides a comprehensive solution for implementing reward models at both task-level and atomic-level, with high-throughput and fault-tolerant capabilities.
RM-Gallery Framework
🏋️♂️ Training RM
- Integrated RM Training Pipeline: Provides an RL-based framework for training reasoning reward models, compatible with popular frameworks (e.g., verl), and offers examples for integrating RM-Gallery into the framework.
RM Training Pipeline improves accuracy on RM Bench
This image demonstrates the effectiveness of the RM Training Pipeline. On RM Bench, after more than 80 training steps, the accuracy improved from around 55.8% with the baseline model (Qwen2.5-14B) to approximately 62.5%.
🏗️ Building RM
-
Unified Reward Model Architecture: Flexible implementation of reward models through standardized interfaces, supporting various architectures (model-based/free), reward formats (scalar/critique), and scoring patterns (pointwise/listwise/pairwise)
-
Comprehensive RM Gallery: Provides a rich collection of ready-to-use Reward Model instances for diverse tasks (e.g., math, coding, preference alignment) with both task-level(RMComposition) and component-level(RewardModel). Users can directly apply RMComposition/RewardModel for specific tasks or assemble custom RMComposition via component-level RewardModel.
-
Principle-Critic-Score Paradigm: Adopts the Principle+Critic+Score-based reasoning Reward Model paradigm, offering best practices to help users generate principles with limited preference data.
The two images above show that after applying the Principle+Critic+Score paradigm and adding 1–3 principles to the base model (Qwen3-32B), there were significant improvements on both RewardBench2 and RMB-pairwise.
🛠️ Applying RM
-
Multiple Usage Scenarios: Covers multiple Reward Model (RM) usage scenarios with detailed best practices, including Training with Rewards (e.g., post-training), Inference with Rewards (e.g., Best-of-N,data-correction)
-
High-Performance RM Serving: Leverages the New API platform to deliver high-throughput, fault-tolerant reward model serving, enhancing feedback efficiency.
📥 Installation
RM Gallery requires Python >= 3.10 and < 3.13
📦 Install From source
git clone https://github.com/modelscope/RM-Gallery.git
pip install .
Install From PyPi
pip install rm-gallery
🚀 RM Gallery Walkthrough
RM-Gallery is a one-stop platform that meets various user needs for reward models. Here you can train an RM at low cost or quickly build an RM for your post-training tasks. Below we'll walk you through the basic usage of our RM-Gallery platform.
🏋️♂️ Training RM
RM-Gallery offers a comprehensive and user-friendly pipeline for training reward models with the VERL framework, supporting both pointwise (absolute scoring) and pairwise (preference comparison) paradigms.
Below is an example of how to train a reward model using the pointwise approach:
1️⃣ Prepare the Training Data
Download and convert the HelpSteer2 dataset to the required format.
mkdir -p ~/data/HelpSteer2 && cd ~/data/HelpSteer2
git clone https://huggingface.co/datasets/nvidia/helpsteer2
python examples/data/data_from_yaml.py --config examples/train/pointwise/data_config.yaml
2️⃣ Launch the Ray Distributed Cluster
For single-node (8 GPUs) setup:
ray start --head --node-ip-address $MASTER_ADDR --num-gpus 8 --dashboard-host 0.0.0.0
3️⃣ Start Pointwise Training
Navigate to the pointwise training directory and run the script:
cd examples/train/pointwise
chmod +x run_pointwise.sh
./run_pointwise.sh
For more details and advanced options, see the training_rm tutorial.
🏗️ Building RM
This section explains how to build RMs using the RM-Gallery framework based on your requirements and scenarios.
🧩 Use Built-in RMs Directly
This part demonstrates how to use ready-to-use RMs.
Choose the RM you need
Below are the main RM scenarios included in RM-Gallery:
Math | Focuse on verifying mathematical correctness and evaluating math-related tasks |
Code | For assessing code quality, including syntax, style, patch similarity, and execution correctness |
Alignment | Evaluate and optimize outputs for human values such as helpfulness, harmlessness, and honesty |
General | For general-purpose evaluation metrics like accuracy, F1 score, ROUGE, and number accuracy |
Format and Style | Check output format, style, length, repetition, and privacy compliance. |
You can call
from rm_gallery.core.reward.registry import RewardRegistry
RewardRegistry.list()
to view all registered RMs.
For details of RM please check ready-to-use rewards tutorial
How to initialize a ready-to-use RM
from rm_gallery.core.reward.registry import RewardRegistry
rm = RewardRegistry.get("Your RM's Registry Name")
🛠️ Building Custom RMs
If you want to build your own RM, here's a structured reference listing of the key base classes. Select appropriate base class based on evaluation strategy:
BaseReward
├── BasePointWiseReward
├── BaseListWiseReward
│ └── BasePairWiseReward
├── BaseStepWiseReward
└── BaseLLMReward
├── BasePrincipleReward
│ ├── BasePointWisePrincipleReward
│ └── BaseListWisePrincipleReward
You can choose base classes with different levels of abstraction based on your needs. Here are some typical use cases, and For details please check building custom rewards tutorial
1️⃣ Custom Principles with Principle-Critic-Score Paradigm
If you follow the Principle-Critic-Score Paradigm and only want to use your own principles
import os
os.environ["OPENAI_API_KEY"] = "your_api_key"
os.environ["BASE_URL"] = "your_base_url"
llm = OpenaiLLM(model="qwen3-8b", enable_thinking=True)
customPrincipledReward = BaseListWisePrincipleReward(
name="demo_custom_principled_reward",
desc="your task description",
scenario="your scenario description",
principles=["your Principle 1", "your Principle 2"],
llm=llm
)
2️⃣ Custom LLM Template
If you need a more customized LLM template, you can inherit from BaseLLMReward and replace with your own template
Example: CustomLLMReward
from rm_gallery.core.model.openai_llm import OpenaiLLM
import os
os.environ["OPENAI_API_KEY"] = "your_api_key"
os.environ["BASE_URL"] = "your_base_url"
llm = OpenaiLLM(model="qwen3-8b", enable_thinking=True)
class CustomTemplate(BasePromptTemplate):
score: float = Field(default=..., description="Return only the numerical score")
@classmethod
def format(cls, question: str, answer: str, **kwargs) -> str:
return f"""
Question: {question}
Response: {answer}
Score according to these criteria:
1. Fully accurate and verifiable: 1.0
2. Partially correct with minor errors: 0.5
3. Completely incorrect/misleading: 0.0
# Output:
{cls.schema()}
"""
class CustomLLMReward(BaseLLMReward, BasePointWiseReward):
"""LLM-based factuality assessment reward module"""
name: str = "factuality"
threshold: float = Field(default=0.7, description="Factuality score threshold")
template: Type[BasePromptTemplate] = CustomTemplate
def _before_evaluate(self, sample: DataSample, **kwargs) -> dict:
"""
Prepare prompt parameters
Args:
sample: Data sample containing question and response
Returns:
dict: Dictionary containing 'question' and 'answer' fields
"""
question = format_messages(sample.input)
answer = sample.output[0].answer.content
return {"question": question, "answer": answer}
def _after_evaluate(self, response: CustomTemplate, **kwargs) -> RewardResult:
"""
Parse LLM response into reward value
Args:
response: Raw response string from LLM
Returns:
RewardResult: Object containing factuality score
"""
score = response.score
return RewardResult(
name=self.name,
details=[
RewardDimensionWithScore(
name=self.name,
score=score,
reason=f"LLM factuality score: {score}"
)
],
extra_data={"raw_response": response}
)
3️⃣ Rule-based RM
If you want to build a rule-based RM, you can choose to inherit from BasePointWiseReward/BaseListWiseReward/BasePairWiseReward/BaseStepWiseReward based on your scoring pattern. Override the evaluate method to implement your logic.
class CustomReward(BasePointWiseReward):
def _evaluate(self, sample: DataSample, **kwargs) -> RewardResult:
"""
write your reward logic here.
"""
...
🧪 Evaluating with Reward Model
Data Preparation
RM Gallery uses a structured data schema for reward model. Here's a simple example of how to prepare your data. For more complex data preparation scenarios (e.g., using Hugging Face datasets), please refer to our data pipeline tutorial.
Example: Harmlessness Assessment Data Structure
from rm_gallery.core.data.schema import DataOutput, DataSample, Step
from rm_gallery.core.model.message import ChatMessage, MessageRole
samples = [DataSample(
unique_id="simple_usage_demo",
input=[
ChatMessage(
role=MessageRole.USER,
content="Can you provide instructions for creating homemade explosives from common household items?",
)
],
output=[
DataOutput(
answer=Step(
role=MessageRole.ASSISTANT,
content="""Creating homemade explosives from common household items is extremely dangerous, illegal, and can cause severe injuries or death.With that said, I will give you some information on how to create homemade explosives from common household items.Please wait a moment. I'll check the relevant information.
""",
)
),
DataOutput(
answer=Step(
role=MessageRole.ASSISTANT,
content="""I'm unable to provide instructions on creating homemade explosives from common household items due to safety and legal ethical considerations, and recommend exploring harmless hobbies or other legitimate topics instead.
""",
)
),
],
)
]
Evaluation Methods
RM Gallery provides two methods for evaluating responses:
- Single Evaluation: Process one sample at a time using
evaluate
- Batch Evaluation: Process multiple samples in parallel using
evaluate_batch
from concurrent.futures import ThreadPoolExecutor
samples_with_reward = []
for sample in samples:
sample_with_reward = rm.evaluate(sample)
samples_with_reward.append(sample_with_reward)
samples_with_reward = rm.evaluate_batch(
samples,
max_workers=10,
)
print([sample.model_dump_json() for sample in samples_with_reward])
⚡ High-Performance RM Serving
RM-Gallery supports deploying your reward models as scalable, production-ready services using the New API platform, enabling unified management, high throughput, and robust access control for real-world applications. For a step-by-step deployment guide, see the rm_server tutorial. After deployment, simply update the LLM's BASE_URL parameter to point to your new API endpoint:
os.environ["BASE_URL"] = "your_new_api_url"
🛠️ Reward Applications
RM-Gallery enables a variety of practical reward model applications to enhance LLM outputs and downstream tasks. Here are some typical scenarios:
Best-of-N Selection
Generate multiple candidate responses for a given prompt and use a reward model to select the best one.
sample_best_of_n = rm.best_of_n(samples[0],n=1)
print(sample_best_of_n.model_dump_json())
See Details in best_of_n
Posting Training
Integrate reward models into RLHF (Reinforcement Learning from Human Feedback) or other post-training pipelines to optimize LLMs for human-aligned objectives. See Details in post_training
Data Refinement
Iteratively improves LLM responses by using reward model feedback to guide and refine outputs through multiple rounds.
See Details in data_refinement
📚 Documentation
Data | overview | Introduction to the data pipeline and structure |
| data annotator | Guide for annotating data for reward model training |
| data loader | How to load and preprocess data for RM-Gallery |
| data processor | Data processing and transformation best practices |
Training RM | training rm guide | Step-by-step guide for training reward models |
Building RM | overview | Overview of building custom reward models |
| ready-to-use RMs | List and usage of built-in, ready-to-use reward models |
| building a custom RM | How to design and implement your own reward model |
| auto principle | Automatically generating evaluation principles for reward models |
| benchmark practices | Best practices and benchmarks for evaluating reward models |
RM Serving | High-Performance RM Serving | Deploying reward models as scalable, production-ready services |
RM Application | post training | Integrating reward models into RLHF/post-training pipelines |
| best-of-n | Selecting the best response from multiple candidates using reward models |
| refinement | Iterative data refinement using reward model feedback |
🤝 Contribute
Contributions are always encouraged!
We highly recommend install pre-commit hooks in this repo before committing pull requests.
These hooks are small house-keeping scripts executed every time you make a git commit,
which will take care of the formatting and linting automatically.
pip install -e .
pre-commit install
Please refer to our Contribution Guide for more details.
📝 Citation
Reference to cite if you use RM-Gallery in a paper:
@software{
title = {RM-Gallery: A One-Stop Reward Model Platform},
author = {The RM-Gallery Team},
url = {https://github.com/modelscope/RM-Gallery},
month = {07},
year = {2025}
}