A Survey on Constraining Policy Updates Using the KL Divergence

2021 
Model-free reinforcement learning methods have proven to be successful in learning complex tasks. Optimizing a policy directly based on observations sampled from an environment eliminates the problem of accumulating model errors that model-based methods suffer from. However, model-free methods are less sample efficient compared to their model-based counterparts and may yield unstable policy updates when the step size between successive policy updates is too large. This survey analyzes and compares three state-of-the-art model-free policy search algorithms that address the latter issue of unstable policy updates: namely, relative entropy policy search (REPS), trust region policy optimization (TRPO) and proximal policy optimization (PPO). All three algorithms constrain the policy update using the Kullback-Leibler (KL) divergence. After an introduction to model-free policy search methods, the importance of KL regularization for policy improvement is illustrated. Subsequently, the KL-regularized reinforcement learning problem is introduced and described. REPS, TRPO and PPO are derived from a single set of equations and their differences are detailed. The survey concludes with a discussion of the algorithms’ weaknesses, pointing out directions for future work.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    15
    References
    0
    Citations
    NaN
    KQI
    []