A Strong On-Policy Competitor To PPO

Xiangxiang Chu

A Strong On-Policy Competitor To PPO

2021

Xiangxiang Chu

As a recognized variant and improvement for Trust Region Policy Optimization (TRPO), proximal policy optimization (PPO) has been widely used with several advantages: efficient data utilization, easy implementation and good parallelism. In this paper, a first-order gradient on-policy learning algorithm called Policy Optimization with Penalized Point Probability Distance (POP3D), which is a lower bound to the square of total variance divergence is proposed as another powerful variant. The penalty item has dual effects, prohibiting policy updates from overshooting and encouraging more explorations. Carefully controlled experiments on both discrete and continuous benchmarks verify our approach is highly competitive to PPO.

Keywords:

Trust region
Mathematical optimization
Square (algebra)
Divergence (statistics)
Total variation
Upper and lower bounds
Dual (category theory)
Point (geometry)
Computer science
parallelism

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations