AgileRL
Reinforcement learning streamlined.
Easier and faster reinforcement learning with RLOps. Visit our website. View documentation.
Join the Discord Server for questions, help and collaboration.
AgileRL is a Deep Reinforcement Learning library focused on improving development by introducing RLOps - MLOps for reinforcement learning.
This library is initially focused on reducing the time taken for training models and hyperparameter optimization (HPO) by pioneering evolutionary HPO techniques for reinforcement learning.
Evolutionary HPO has been shown to drastically reduce overall training times by automatically converging on optimal hyperparameters, without requiring numerous training runs.
We are constantly adding more algorithms and features. AgileRL already includes state-of-the-art evolvable on-policy, off-policy, offline, multi-agent and contextual multi-armed bandit reinforcement learning algorithms with distributed training.
AgileRL offers 10x faster hyperparameter optimization than SOTA.
Table of Contents
Get Started
To see the full AgileRL documentation, including tutorials, visit our documentation site. To ask questions and get help, collaborate, or discuss anything related to reinforcement learning, join the AgileRL Discord Server.
Install as a package with pip:
pip install agilerl
Or install in development mode:
git clone https://github.com/AgileRL/AgileRL.git && cd AgileRL
pip install -e .
Benchmarks
Reinforcement learning algorithms and libraries are usually benchmarked once the optimal hyperparameters for training are known, but it often takes hundreds or thousands of experiments to discover these. This is unrealistic and does not reflect the true, total time taken for training. What if we could remove the need to conduct all these prior experiments?
In the charts below, a single AgileRL run, which automatically tunes hyperparameters, is benchmarked against Optuna's multiple training runs traditionally required for hyperparameter optimization, demonstrating the real time savings possible. Global steps is the sum of every step taken by any agent in the environment, including across an entire population.
AgileRL offers an order of magnitude speed up in hyperparameter optimization vs popular reinforcement learning training frameworks combined with Optuna. Remove the need for multiple training runs and save yourself hours.
AgileRL also supports multi-agent reinforcement learning using the Petting Zoo-style (parallel API). The charts below highlight the performance of our MADDPG and MATD3 algorithms with evolutionary hyper-parameter optimisation (HPO), benchmarked against epymarl's MADDPG algorithm with grid-search HPO for the simple speaker listener and simple spread environments.
Tutorials
We are constantly updating our tutorials to showcase the latest features of AgileRL and how users can leverage our evolutionary HPO to achieve 10x faster hyperparameter optimization. Please see the available tutorials below.
Evolvable algorithms (more coming soon!)
Single-agent algorithms
Multi-agent algorithms
Contextual multi-armed bandit algorithms
Train an agent to beat a Gym environment
Before starting training, there are some meta-hyperparameters and settings that must be set. These are defined in INIT_HP
, for general parameters, and MUTATION_PARAMS
, which define the evolutionary probabilities, and NET_CONFIG
, which defines the network architecture. For example:
INIT_HP = {
'ENV_NAME': 'LunarLander-v2',
'ALGO': 'DQN',
'DOUBLE': True,
'CHANNELS_LAST': False,
'BATCH_SIZE': 256,
'LR': 1e-3,
'MAX_STEPS': 1_000_000,
'TARGET_SCORE': 200.,
'GAMMA': 0.99,
'MEMORY_SIZE': 10000,
'LEARN_STEP': 1,
'TAU': 1e-3,
'TOURN_SIZE': 2,
'ELITISM': True,
'POP_SIZE': 6,
'EVO_STEPS': 10_000,
'EVAL_STEPS': None,
'EVAL_LOOP': 1,
'LEARNING_DELAY': 1000,
'WANDB': True,
}
MUTATION_PARAMS = {
'NO_MUT': 0.4,
'ARCH_MUT': 0.2,
'NEW_LAYER': 0.2,
'PARAMS_MUT': 0.2,
'ACT_MUT': 0,
'RL_HP_MUT': 0.2,
'MUT_SD': 0.1,
'RAND_SEED': 1,
}
NET_CONFIG = {
'latent_dim': 16
'encoder_config': {
'hidden_size': [32]
}
'head_config': {
'hidden_size': [32]
}
}
First, use utils.utils.create_population
to create a list of agents - our population that will evolve and mutate to the optimal hyperparameters.
import torch
from agilerl.utils.utils import (
make_vect_envs,
create_population,
observation_space_channels_to_first
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
num_envs = 16
env = make_vect_envs(env_name=INIT_HP['ENV_NAME'], num_envs=num_envs)
observation_space = env.single_observation_space
action_space = env.single_action_space
if INIT_HP['CHANNELS_LAST']:
observation_space = observation_space_channels_to_first(observation_space)
agent_pop = create_population(
algo=INIT_HP['ALGO'],
observation_space=observation_space,
action_space=action_space,
net_config=NET_CONFIG,
INIT_HP=INIT_HP,
population_size=INIT_HP['POP_SIZE'],
num_envs=num_envs,
device=device
)
Next, create the tournament, mutations and experience replay buffer objects that allow agents to share memory and efficiently perform evolutionary HPO.
from agilerl.components.replay_buffer import ReplayBuffer
from agilerl.hpo.tournament import TournamentSelection
from agilerl.hpo.mutation import Mutations
field_names = ["state", "action", "reward", "next_state", "done"]
memory = ReplayBuffer(
memory_size=INIT_HP['MEMORY_SIZE'],
field_names=field_names,
device=device,
)
tournament = TournamentSelection(
tournament_size=INIT_HP['TOURN_SIZE'],
elitism=INIT_HP['ELITISM'],
population_size=INIT_HP['POP_SIZE'],
eval_loop=INIT_HP['EVAL_LOOP'],
)
mutations = Mutations(
no_mutation=MUTATION_PARAMS['NO_MUT'],
architecture=MUTATION_PARAMS['ARCH_MUT'],
new_layer_prob=MUTATION_PARAMS['NEW_LAYER'],
parameters=MUTATION_PARAMS['PARAMS_MUT'],
activation=MUTATION_PARAMS['ACT_MUT'],
rl_hp=MUTATION_PARAMS['RL_HP_MUT'],
mutation_sd=MUTATION_PARAMS['MUT_SD'],
rand_seed=MUTATION_PARAMS['RAND_SEED'],
device=device,
)
The easiest training loop implementation is to use our train_off_policy()
function. It requires the agent
have methods get_action()
and learn().
from agilerl.training.train_off_policy import train_off_policy
trained_pop, pop_fitnesses = train_off_policy(
env=env,
env_name=INIT_HP['ENV_NAME'],
algo=INIT_HP['ALGO'],
pop=agent_pop,
memory=memory,
swap_channels=INIT_HP['CHANNELS_LAST'],
max_steps=INIT_HP["MAX_STEPS"],
evo_steps=INIT_HP['EVO_STEPS'],
eval_steps=INIT_HP["EVAL_STEPS"],
eval_loop=INIT_HP["EVAL_LOOP"],
learning_delay=INIT_HP['LEARNING_DELAY'],
target=INIT_HP['TARGET_SCORE'],
tournament=tournament,
mutation=mutations,
wb=INIT_HP['WANDB'],
)
Citing AgileRL
If you use AgileRL in your work, please cite the repository:
@software{Ustaran-Anderegg_AgileRL,
author = {Ustaran-Anderegg, Nicholas and Pratt, Michael and Sabal-Bermudez, Jaime},
license = {Apache-2.0},
title = {{AgileRL}},
url = {https://github.com/AgileRL/AgileRL}
}