Blackwell Online Learning for Markov Decision Processes

Tao Li,Guanze Peng,Quanyan Zhu

Blackwell Online Learning for Markov Decision Processes

2021

Tao Li
Guanze Peng
Quanyan Zhu

Ahstract-This work provides a novel interpretation of Markov Decision Processes (MDP) from the online optimization viewpoint. In such an online optimization context, the policy of the MDP is viewed as the decision variable while the corresponding value function is treated as payoff feedback from the environment. Based on this interpretation, we construct a Blackwell game induced by MDP, which bridges the gap among regret minimization, Blackwell approachability theory, and learning theory for MDP. Specifically, Based on the approachability theory, we propose 1) Blackwell value iteration for offline planning and 2) Blackwell Q-learning for online learning in MDP, both of which are shown to converge to the optimal solution. Our theoretical guarantees are corroborated by numerical experiments.

Keywords:

Bellman equation
Markov decision process
Markov process
context
Approachability
Mathematical optimization
Dynamic programming
Multi-agent system
Stochastic game
Computer science

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations