higher
is a library providing support for higher-order optimization, e.g. through unrolled first-order optimization loops, of "meta" aspects of these loops. It provides tools for turning existing torch.nn.Module
instances "stateless", meaning that changes to the parameters thereof can be tracked, and gradient with regard to intermediate parameters can be taken. It also provides a suite of differentiable optimizers, to facilitate the implementation of various meta-learning approaches.
Full documentation is available at https://higher.readthedocs.io/en/latest/.
Requirements and Installation
- Python version >= 3.5
- PyTorch version >= 1.3
To install higher
from PyPi:
pip install higher
To install higher
from source:
git clone git@github.com:facebookresearch/higher.git
cd higher
pip install .
Alternatively python setup.py install
will do the same thing.
Citation
If you use higher
in your research and found it helpful, please consider citing the following paper:
@article{grefenstette2019generalized,
title={Generalized Inner Loop Meta-Learning},
author={Grefenstette, Edward and Amos, Brandon and Yarats, Denis and Htut, Phu Mon and Molchanov, Artem and Meier, Franziska and Kiela, Douwe and Cho, Kyunghyun and Chintala, Soumith},
journal={arXiv preprint arXiv:1910.01727},
year={2019}
}
Use case
Your needs
You have a model
with parameters P
, where P[t]
denotes the parameters at update timestep t
.
You want to update the model through k
steps of optimization, and compute gradients through the optimization process,
i.e. compute torch.autograd.grad(P[k], P[0])
or obtain gradients that depend on this gradient pathway existing.
Your obstacles
You are using some existing code for your model
, so the parameters are stateful, preventing you from forming a graph with P[t]
as nodes.
Even if you roll your own solution, you want to use optimization techniques beyond normal SGD, and torch.optim
optimizers don't let you optimize "through" them.
Your solution
Good news: higher
has got you covered! Using our growing set of tools and utility functions, you can backpropagate through an unbounded number of model update steps for all your meta-learning needs.
This library includes:
- Helper functions for monkey-patching
torch.nn
modules to make them functional (non-stateful), i.e. feed their parameters as an extra argument during the forward pass. - Classes implementing differentiable versions of
torch.optim.Adam
(and SGD), designed to track or branch out from the state of a "normal" Adam
instance.
Example Usage
Say your training code looks like this:
model = MyModel()
opt = torch.optim.Adam(model.parameters())
for xs, ys in data:
opt.zero_grad()
logits = model(xs)
loss = loss_function(logits, ys)
loss.backward()
opt.step()
To turn this into a differentiable version, the following changes should be introduced:
model = MyModel()
opt = torch.optim.Adam(model.parameters())
with higher.innerloop_ctx(model, opt) as (fmodel, diffopt):
for xs, ys in data:
logits = fmodel(xs)
loss = loss_function(logits, ys)
diffopt.step(loss)
grad_of_grads = torch.autograd.grad(
meta_loss_fn(fmodel.parameters()), fmodel.parameters(time=0))
Beware that when unrolling your optimisation like this for k
, all gradients and all activations of your model at each step is kept in memory,
meaning the memory footprint of your model is k
times greater.
Adding your own optimizers
It is possible to use optimizers other that those found in torch.optim
. A differentiable version must be implemented first. This can be done by subclassing higher.optim.DifferentiableOptimizer
and overriding the _update
method, following the arguments of the original. Assuming the logic of the optimizer being added follows the logic of those found in torch.optim
, the steps to follow are more or less:
- Remove the following code (no support for closures).
loss = None
if closure is not None:
loss = closure()
- Replace
for group in self.param_groups:
for p in group['params']:
if p.grad is None:
continue
grad = p.grad.data
with
zipped = zip(self.param_groups, grouped_grads)
for group_idx, (group, grads) in enumerate(zipped):
for p_idx, (p, g) in enumerate(zip(group['params'], grads)):
if g is None:
continue
- Replace
state = self.state[p]
with state = self.state[group_idx][p_idx]
. - Replace any in-place op with a non in-place op, e.g.
t.add_(a, x).mul_(y)
should become t = t.add(a, x).mul(y)
(note the assignment). Be careful to also track where dictionaries are being implicitly updated by such ops, e.g. if there is code of the form:
p = state['k']
...
p.add_(a, x)
in the original optimizer, this code should be converted to
p = state['k']
...
state['k'] = p = p.add(a, x)
to ensure the corresponding dictionary is. - Except where used for shape inference, replace instances of
t.data
with t
for all t
. - Be sure to update
group['params'][p_idx]
for each p_idx
in need of update (those ignored will yield the original parameters in the fast weight collection). The latest fast weights will be returned by the inherited step
function. - Importantly, you need to register your new differentiable optimizer with
higher
using higher.register_optim
to ensure that it is recognized as an option by the library's methods. You can do this at any point after the definition of an optimizer, and before any higher
code involving that optimizer is called. For example, if you have implemented MyDiffOpt
as a differentiable version of some optimizer MyOpt
, register it by adding the line higher.register_optim(MyOpt, MyDiffOpt)
after the classes are defined.
You can find examples of how to test for gradient correctness using finite difference methods in tests/test_optim.py
. Please note that some stability tricks may be needed to avoid nan
s in the gradients. See the higher.optim.DifferentiableAdam
implementation for examples of mitigation strategies, e.g. identify operations that yield exploding gradients, e.g. typically those taking the square roots of moving averages (which are intially zero), and register a backward hook using x.register_hook
on the inputs x
to those functions, using the helper function _get_mask_closure
from higher.optim
.
Release Notes
See the changelog for release notes.
Known/Possible Issues
- See the issues tracker for an up-to-date list.
- No support (or planned support) for
torch.nn.DataParallel
at this time. This would require a rewrite of DataParallel
. Please raise an issue on the pytorch issue tracker if this matters to you. - Some of the adaptative gradient-style differentiable optimizers may be unstable and yield NaNs when taking higher order gradients. Some tricks have been used to mitigate this risk. Please raise an issue if these are not sufficient in practice.
- Second-order gradients may not work with some CUDNN modules (mostly RNNs). From PyTorch v1.3 onwards, wrapping the code where models are used with
higher
using the following context manager should solve the issue:
with torch.backends.cudnn.flags(enabled=False):
License
higher
is released under Apache License Version 2.0.
Thanks
Thanks to Adam Paszke
whose gist
was the source of inspiration (and starting point) for our method for monkey
patching arbitrary torch.nn
modules.
Thanks for the many interns, researchers, and engineers who helped road-test early versions of this library.