Model Learning for Multistep Backward Prediction in Dyna-Q Learning

作者:Hwang, Kao Shing*; Jiang, Wei Cheng; Chen, Yu Jen; Hwang, Iris
来源:IEEE Transactions on Systems Man Cybernetics-Systems, 2018, 48(9): 1470-1481.
DOI:10.1109/TSMC.2017.2671848

摘要

A model-based reinforcement learning (RL) method which interplays direct and indirect learning to update Q functions is proposed. The environment is approximated by a virtual model that can predict the transition to the next state and the reward of the domain. This virtual model is used to train Q functions to accelerate policy learning. Lookup table methods are usually used to establish such environmental models, but these methods need to collect tremendous amounts of experiences to enumerate responses of the environment. In this paper, a stochastic model learning method based on tree structures is presented. To model the transition probability, an online clustering method is applied to equip the model learning method with the abilities to evaluate the transition probability. By the virtual model, the RL method produces simulated experience in the stage of indirect learning. Since simulated transitions and backups arc more usefully focused by working backward from the state-action, the pair estimated Q value of which changes significantly, the useful one-step backups are actions that lead directly into the one state whose value has already obviously been changed. This, however, may induce a false positive; that is, a backup state may be an invalid state, such as an absorbing or terminal state, especially in cases where the changes of Q values at the planning stage are still needed to put back for ranking even though they are based on a simulated experience and are possibly erroneous. It is obvious that when the agent is attracted to generate simulated experience around the area of these absorbing states, the learning efficiency is deteriorated. This paper proposes three detecting methods to solve this problem. Moreover, the policy learning can speed up. The effectiveness and generality of our method is further demonstrated in three numerical simulations. The simulation results demonstrate that the training rate of our method is obviously improved.