Recurrent prediction model for partially observable MDPs

2023 
Partially observable Markov decision process (POMDP) is a key challenging problem in the application of reinforcement learning since it comprehensively describes real agent-environment interactions. Recent works mainly utilize conventional reward signals to train a representation that converts POMDPs to MDPs. However, rewards alone are not enough for a good representation without temporal information. In this paper, we first introduce a novel Recurrent Prediction Model to integrate temporal information into the representation that solves POMDP problems by training three additional unsupervised prediction models, named transition model, reward recovery model, and observation recovery model. This paper secondly makes a modification of the data structure of vanilla replay buffer to reduce the memory usage and thirdly proposes an off-policy correction algorithm to decrease the policy lag in POMDPs. The experiments show that our model achieves better performance in partially observable environments on both stand-alone and distributed training systems.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []