CrossNorm: Normalization for Off-Policy TD Reinforcement Learning.

Aditya Bhatt,Max Argus,Artemij Amiranashvili,Thomas Brox

CrossNorm: Normalization for Off-Policy TD Reinforcement Learning.

2019

Off-policy Temporal Difference (TD) learning methods, when combined with function approximators, suffer from the risk of divergence, a phenomenon known as the deadly triad. It has long been noted that some feature representations work better than others. In this paper we investigate how feature normalization can prevent divergence and improve training. Our method, which we call CrossNorm, can be regarded as a new variant of batch normalization that re-centers data for multi-modal distributions, which occur in the off-policy TD updates. We show empirically that CrossNorm improves the stability of the learning process. We apply CrossNorm to DDPG and TD3 and achieve stable training and improved performance across a range of MuJoCo benchmark tasks. Moreover, for the first time, we are able to train DDPG stably without the use of target networks.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations