
Research
/Security News
Critical Vulnerability in NestJS Devtools: Localhost RCE via Sandbox Escape
A flawed sandbox in @nestjs/devtools-integration lets attackers run code on your machine via CSRF, leading to full Remote Code Execution (RCE).
OpenRLHF is the first easy-to-use, high-performance open-source RLHF framework built on Ray, vLLM, ZeRO-3 and HuggingFace Transformers, designed to make RLHF training simple and accessible:
More details are in Slides | Technical Report | Documents
--async_train
) and Async Agent RLHF(--agent_func_path
) with redesigned class-based Agent API--colocate_all_models
, --vllm_enable_sleep
and --vllm_gpu_memory_utilization 0.5
)--vllm_num_engines
).--dynamic_filtering
and --dynamic_filtering_reward_range
)--ds_tensor_parallel_size
)--ring_attn_size
, --ring_head_stride
).--packing_samples
).--aux_loss_coef
).--flash_attn
).--load_in_4bit
) and LoRA (--lora_rank
, --target_modules
).tokenizer.apply_chat_template
for datasets (--apply_chat_template
and --input_key
).--use_wandb
) and TensorBoard (--use_tensorboard
).--load_checkpoint
and --save_steps
).To use OpenRLHF, first launch the docker container (Recommended) and pip install
openrlhf inside the docker container:
# Launch the docker container
docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN -v $PWD:/openrlhf nvcr.io/nvidia/pytorch:25.02-py3 bash
sudo pip uninstall xgboost transformer_engine flash_attn pynvml -y
# pip install
pip install openrlhf
# If you want to use vLLM acceleration (Install vLLM 0.10.0)
pip install openrlhf[vllm]
# latest vLLM is also supported
pip install openrlhf[vllm_latest]
# Install vLLM, ring-flash-attention and Liger-Kernel
pip install openrlhf[vllm,ring,liger]
# pip install the latest version
pip install git+https://github.com/OpenRLHF/OpenRLHF.git
# Or git clone
git clone https://github.com/OpenRLHF/OpenRLHF.git
cd OpenRLHF
pip install -e .
[!NOTE] We recommend using vLLM 0.10.0 or higher. We also provided the Dockerfiles for vLLM and One-Click Installation Script of Nvidia-Docker.
OpenRLHF provides multiple data processing methods in our dataset classes. Such as in the Prompt Dataset:
def preprocess_data(data, input_template=None, input_key="input", apply_chat_template=None) -> str:
if apply_chat_template:
chat = data[input_key]
if isinstance(chat, str):
chat = [{"role": "user", "content": chat}]
prompt = apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
else:
prompt = data[input_key]
if input_template:
prompt = input_template.format(prompt)
return prompt
--input_key
to specify the JSON key name
of the input datasets --prompt_data {name or path}
(PPO) or --dataset {name or path}
, and use --apply_chat_template
to utilize the chat_template
from the Huggingface Tokenizer.--apply_chat_template
, you can use --input_template
instead, or preprocess the datasets offline in advance.--prompt_data_probs 0.1,0.4,0.5
(PPO) or --dataset_probs 0.1,0.4,0.5
.How Chat Templating Works:
dataset = [{"input_key": [
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
{"role": "user", "content": "I'd like to show off how chat templating works!"},
]}]
tokenizer.apply_chat_template(dataset[0]["input_key"], tokenize=False)
"<s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]"
How to specify test datasets ?
Please set test datasets path using --eval_dataset {name or path}
.
[!NOTE] The
JSON key
options depends on the specific datasets. See Reward Dataset and SFT Dataset
OpenRLHF's model checkpoint is fully compatible with HuggingFace models. You can specify the model name or path using --pretrain {name or path}
, --reward_pretrain {name or path}
and --critic_pretrain {name or path}
. We have provided some pre-trained checkpoints and datasets on HuggingFace OpenRLHF.
Then you can use the startup scripts we provide in the examples/scripts directory, or start the training using the following commands.
deepspeed --module openrlhf.cli.train_sft \
--max_len 4096 \
--dataset Open-Orca/OpenOrca \
--input_key question \
--output_key response \
--input_template $'User: {}\nAssistant: ' \
--train_batch_size 256 \
--micro_train_batch_size 2 \
--max_samples 500000 \
--pretrain meta-llama/Meta-Llama-3-8B \
--save_path ./checkpoint/llama3-8b-sft \
--save_steps -1 \
--logging_steps 1 \
--eval_steps -1 \
--zero_stage 2 \
--max_epochs 1 \
--packing_samples \
--bf16 \
--flash_attn \
--learning_rate 5e-6 \
--gradient_checkpointing \
--use_wandb {wandb_token}
# Support HF tokenizer.apply_chat_template
# --apply_chat_template
# --tokenizer_chat_template {HF Chat Template}
# Support RingAttention
# pip install ring_flash_attn
# --ring_attn_size 2 \
# --ring_head_stride 2 \
# Multi-turn fine-tuning loss
# --multiturn
# Can also be used for continued pre-training
# --pretrain_mode
[!NOTE] OpenRLHF SFT/DPO/RewardModel/PPO trainers support
--packing_samples
based on--flash_attn
deepspeed --module openrlhf.cli.train_rm \
--save_path ./checkpoint/llama3-8b-rm \
--save_steps -1 \
--logging_steps 1 \
--eval_steps -1 \
--train_batch_size 256 \
--micro_train_batch_size 1 \
--pretrain OpenRLHF/Llama-3-8b-sft-mixture \
--bf16 \
--max_epochs 1 \
--max_len 8192 \
--zero_stage 3 \
--learning_rate 9e-6 \
--dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
--apply_chat_template \
--chosen_key chosen \
--rejected_key rejected \
--flash_attn \
--packing_samples \
--gradient_checkpointing \
--use_wandb {wandb_token}
It is recommended to set the --value_prefix_head
option of the Reward Model to score
, so that we can load the model using AutoModelForSequenceClassification
:
reward_model = AutoModelForSequenceClassification.from_pretrained(
reward_model_path,
num_labels=1,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
use_cache=False,
)
inputs = xxxx (Left Padding Input Tokens)
reward = reward_model.model(*inputs).last_hidden_state
reward = reward_model.score(reward)[:, -1]
To improve RLHF training speed or support 70B models, we can use the PPO with Ray and vLLM acceleration (Hybrid Engine)
# launch the master node of ray in container
ray start --head --node-ip-address 0.0.0.0 --num-gpus 8
# if you want to launch ray on more nodes, use
ray start --address {MASTER-NODE-ADDRESS}:6379 --num-gpus 8
ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json='{"working_dir": "/openrlhf"}' \
-- python3 -m openrlhf.cli.train_ppo_ray \
--ref_num_nodes 1 \
--ref_num_gpus_per_node 8 \
--reward_num_nodes 1 \
--reward_num_gpus_per_node 8 \
--critic_num_nodes 1 \
--critic_num_gpus_per_node 8 \
--actor_num_nodes 1 \
--actor_num_gpus_per_node 8 \
--vllm_num_engines 4 \
--vllm_tensor_parallel_size 2 \
--colocate_all_models \
--vllm_gpu_memory_utilization 0.5 \
--pretrain OpenRLHF/Llama-3-8b-sft-mixture \
--reward_pretrain OpenRLHF/Llama-3-8b-rm-700k \
--save_path /openrlhf/examples/test_scripts/final/llama3-8b-rlhf \
--ckpt_path /openrlhf/examples/test_scripts/ckpt/llama3-8b-rlhf \
--save_hf_ckpt \
--micro_train_batch_size 8 \
--train_batch_size 128 \
--micro_rollout_batch_size 16 \
--rollout_batch_size 1024 \
--n_samples_per_prompt 1 \
--max_epochs 1 \
--prompt_max_len 1024 \
--max_samples 100000 \
--generate_max_len 1024 \
--zero_stage 3 \
--bf16 \
--actor_learning_rate 5e-7 \
--critic_learning_rate 9e-6 \
--init_kl_coef 0.01 \
--prompt_data OpenRLHF/prompt-collection-v0.1 \
--input_key context_messages \
--apply_chat_template \
--normalize_reward \
--gradient_checkpointing \
--packing_samples \
--vllm_sync_backend nccl \
--enforce_eager \
--vllm_enable_sleep \
--deepspeed_enable_sleep
--use_wandb {wandb_token}
# Support REINFORCE++ | RLOO | REINFORCE++-baseline | GRPO | Dr. GRPO
# --advantage_estimator reinforce | rloo | reinforce_baseline | group_norm | dr_grpo
# Set --init_kl_coef to 0 will not launch the reference model
# Support remote reward model (HTTP)
# --remote_rm_url http://localhost:5000/get_reward
# Support N samples
# --n_samples_per_prompt 4
[!NOTE] You can also use
setup_commands
to let Ray automatically deploy the environment, such as--runtime-env-json='{"setup_commands": ["pip install openrlhf[vllm]"]}'
.
[!NOTE] RLOO and REINFORCE++-baseline in OPENRLHF are a modification based on REINFORCE++:
- REINFORCE++ integrates key optimization techniques from PPO (such as advantage normalization and PPO-clip loss) into REINFORCE while eliminating the need for a critic network.
- REINFORCE++-baseline uses the
mean reward of multiple samples from the same prompt
as the baseline to reshape the rewards, thereby filtering out responses that are either entirely correct or entirely incorrect, then apply the global advantage normalization in REINFORCE++.- RLOO in OpenRLHF modifies the original version by incorporating the
per-token KL reward
and utilizing thePPO-clip loss
.- Dr. GRPO remove the local group normalization
/std
in GRPO.
[!NOTE] If you you encounter an error related to index out of range when deepspeed sets up the GPU devices, you can try to set the environment variable
RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES
as a workaround.
# For NVIDIA GPUs: export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1
The launch scripts and documents for supported algorithms are in example/scripts and Documents - Usage
OpenRLHF supports convenient and efficient Reinforced Fine-tuning. You only need to implement a file containing the custom reward_func
function and pass its path to the remote_rm_url
parameter. Such as
# reward_func.py
import torch
def reward_func(queries, prompts, labels):
# queries is prompts + responses
# labels is answers
print(queries)
# Generate random rewards as an example
# In real applications, this should be replaced with actual reward calculation logic
reward = torch.randint(0, 2, (len(queries),)).float()
return {
"rewards": reward, # Rewards for advantage calculation
"scores": reward, # Scores for dynamic filtering (0-1 reward)
"extra_logs": {"dummy_scores": reward}, # Additional logging info for wandb
}
then just set
ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json='{"working_dir": "/openrlhf"}' \
-- python3 -m openrlhf.cli.train_ppo_ray \
...
--remote_rm_url /path/to/reward_func.py \
--label_key answer
where the label_key
parameter is used to pass additional sample information such as answer to the reward function.
OpenRLHF provides comprehensive support for both Asynchronous RLHF and Agent-based RLHF implementations. To utilize these features, simply include the --async_train
and --agent_func_path
parameters in your training configuration.
The Agent API has been redesigned to use a class-based approach with AgentInstanceBase
and AgentExecutorBase
classes for better modularity and extensibility.
# agent_func.py
import random
from typing import Any, Dict
import torch
from openrlhf.utils.agent import AgentExecutorBase, AgentInstanceBase
# A simple n-step random environment
class AgentInstance(AgentInstanceBase):
async def __init__(self, *args, **kwargs):
self.step_idx = 0
self.max_steps = random.randint(1, 3) # 1-3 steps
async def reset(self, states: dict, **kwargs):
return {"observation": states["observation"]} # Return original text observation
async def step(self, states: dict, **kwargs) -> Dict[str, Any]:
print(f"step_idx: {self.step_idx}, max_steps: {self.max_steps}")
observation_text = states["observation_text"]
action_text = states["action_text"]
label = states["label"]
# Check if episode is done
done = self.step_idx >= self.max_steps
reward = torch.randint(0, 2, (1,)).float() if done else torch.tensor(0)
# Generate environment feedback based on whether episode is done
environment_feedback = (
"\n\nHuman: [CORRECT]\n</s>"
if done
else "\n\nHuman: [INCORRECT]\nPlease analyze the issues and try again.\n</s>\n\nAssistant: "
)
self.step_idx += 1
return {
"rewards": reward, # Rewards for advantage calculation
"scores": reward, # Scores for dynamic filtering (0-1 reward)
"environment_feedback": environment_feedback, # Environment feedback text
"done": done, # Boolean indicating if the episode is complete
"sampling_params": states.get("sampling_params", None), # Parameters for vLLM sampling in next step
"extra_logs": {"dummy_scores": reward}, # Additional logging information
}
class AgentExecutor(AgentExecutorBase):
def __init__(self, max_steps, max_length, llm_engine, hf_tokenizer, result_queue):
super().__init__(AgentInstance, max_steps, max_length, llm_engine, hf_tokenizer, result_queue)
async def execute(self, prompt, label, sampling_params):
# You could override the execute function of AgentExecutorBase to add custom agent running logic
return await super().execute(prompt, label, sampling_params)
You can also configure the maximum number of concurrent agents per vLLM engine by setting export OPENRLHF_ASYNC_NUM_TASKS=128
.
Additionally, you can control the degree of off-policy sampling by setting export OPENRLHF_ASYNC_QUEUE_SIZE=1
(this parameter controls how many batches of data can be stored in the buffer at most) in your environment.
[!NOTE] By overriding the
execute
function ofAgentExecutorBase
, you can implement completely custom agent running processes. The design follows the token-in-token-out principle to ensure consistency between sampling and training samples, avoiding potential mismatches that could occur with text-level processing.
[!NOTE] OpenRLHF's Agent RLHF also supports Hybrid Engine training. To enable this feature, please remove the
--async_train
flag and enable--colocate_all_models
.
[!WARNING] Asynchronous training may affect the training stability. It is recommended to prioritize using Hybrid Engine or synchronous training mode.
If you use LoRA (Low-Rank Adaptation)
, OpenRLHF
will not save the full weights by default instead of LoRA Adapter
. To continue in your task normally, you should combine the Adapter
with weights of your base model
python -m openrlhf.cli.lora_combiner \
--model_path meta-llama/Meta-Llama-3-8B \
--lora_path ./checkpoint/llama3-8b-rm \
--output_path ./checkpoint/llama-3-8b-rm-combined \
--is_rm \
--bf16
To achieve optimal performance, we recommend allocating nodes vLLM:Actor:Critic = 1:1:1
.
--async_train
when the convergence of the RL algorithm meets requirements.--colocate_all_models
and --vllm_enable_sleep
and --deepspeed_enable_sleep
rather than distributed RLHF when there are enough GPU memory.--colocate_critic_reward
, --colocate_actor_ref
options to merge nodes.rollout_micro_batch_size
(and minimize the TP size of vLLM engine) as much as possible. During the training phase, a larger --micro_train_batch_size
is better and enable --packing_samples
.--adam_offload
and enable --overlap_comm
. Also enable --deepcompile
to speed up the training.--vllm_sync_backend nccl
--use_dynamic_batch
to accelerate the deepspeed training and forward.n_samples_per_prompts
> 1.--colocate_xxxx
options.How to Join?
What can you do?
Your sponsorship can help us maintain and improve OpenRLHF. If you find this project useful, please consider sponsoring us. You can sponsor us on Open Collective ↗.
A big thank you to all our contributors! If you want to contribute, feel free to make a pull request or create an issue.
We would like to express our gratitude to the following projects and organizations for their contributions to the field of AI and NLP:
Our project would also like to thank ColossalChat and DeepSpeedChat. In the early stages of the project, we referred to their code design. Our project would like to thank Netmind.AI for the GPU support of developing ring attention.
(2024/7) Our GitHub organization has changed from OpenLLMAI to OpenRLHF.
OpenRLHF
@article{hu2024openrlhf,
title={OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework},
author={Jian Hu and Xibin Wu and Zilin Zhu and Xianyu and Weixun Wang and Dehao Zhang and Yu Cao},
journal={arXiv preprint arXiv:2405.11143},
year={2024}
}
REINFORCE++-baseline
@article{hu2025reinforce++,
title={Reinforce++: A simple and efficient approach for aligning large language models},
author={Hu, Jian},
journal={arXiv preprint arXiv:2501.03262},
year={2025}
}
OpenRLHF © 2025 OpenRLHF. All Rights Reserved.
FAQs
A Ray-based High-performance RLHF framework.
We found that openrlhf demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
/Security News
A flawed sandbox in @nestjs/devtools-integration lets attackers run code on your machine via CSRF, leading to full Remote Code Execution (RCE).
Product
Customize license detection with Socket’s new license overlays: gain control, reduce noise, and handle edge cases with precision.
Product
Socket now supports Rust and Cargo, offering package search for all users and experimental SBOM generation for enterprise projects.