Adam Layer-wise LR Decay
In ELECTRA,
which had been published by Stanford University and Google Brain,
they had used Layerwise LR Decay technique for the Adam optimizer to prevent Catastrophic forgetting of Pre-trained model.
This repo contains the implementation of Layer-wise LR Decay for Adam, with new Optimizer API that had been proposed in TensorFlow 2.11.
Usage
Installations:
$ pip install adam-lr-decay
For CPU:
$ pip install adam-lr-decay[cpu]
For GPU:
$ pip install adam-lr-decay[gpu]
from tensorflow.keras import layers, models
from adam_lr_decay import AdamLRDecay
model = models.Sequential([
layers.Dense(3, input_shape=(2,), name='hidden_dense'),
layers.Dense(1, name='output')
])
adam = AdamLRDecay(learning_rate=1e-3)
adam.apply_layerwise_lr_decay(var_name_dicts={
'hidden_dense': 0.1,
'output': 0.
})
model.compile(optimizer=adam)
In official ELECTRA repo,
they have defined the decay rate in the code. The adapted version is as follows:
import collections
from adam_lr_decay import AdamLRDecay
def _get_layer_lrs(layer_decay, n_layers):
key_to_depths = collections.OrderedDict({
'/embeddings/': 0,
'/embeddings_project/': 0,
'task_specific/': n_layers + 2,
})
for layer in range(n_layers):
key_to_depths['encoder/layer_' + str(layer) + '/'] = layer + 1
return {
key: 1. - (layer_decay ** (n_layers + 2 - depth))
for key, depth in key_to_depths.items()
}
adam = AdamLRDecay(learning_rate=1e-3)
adam.apply_layerwise_lr_decay(var_name_dicts=_get_layer_lrs(0.9, 8))
The generated decay rate must be looked like this. 0.0
means there is no decay and 1.0
means it is zero learning rate. (non-trainable)
{
"/embeddings/": 0.6513215599,
"/embeddings_project/": 0.6513215599,
"task_specific/": 0.0,
"encoder/layer_0/": 0.6125795109999999,
"encoder/layer_1/": 0.5695327899999999,
"encoder/layer_2/": 0.5217030999999999,
"encoder/layer_3/": 0.46855899999999995,
"encoder/layer_4/": 0.40950999999999993,
"encoder/layer_5/": 0.3439,
"encoder/layer_6/": 0.2709999999999999,
"encoder/layer_7/": 0.18999999999999995
}
Citation
@article{clark2020electra,
title={Electra: Pre-training text encoders as discriminators rather than generators},
author={Clark, Kevin and Luong, Minh-Thang and Le, Quoc V and Manning, Christopher D},
journal={arXiv preprint arXiv:2003.10555},
year={2020}
}