摘要

Temporal difference (TD) constitutes a class of methods for learning predictions in multistep prediction problems. The most important application of these methods is to temporal credit assignment in reinforcement learning. Although these TD procedures work in theory and in principle, its success is contingent on proper selection of parametric values. As well, its learning is majorly based on repeated exposures, which may not always be practical or feasible. This paper examines the issues of the efficient and general implementation of TD for hardware implementation of reinforcement learning algorithms by synthesizing the series of discounted sum of rewards along time. The proposed algorithm eliminates all step size parameters and improves data efficiency based on a synthetic approach of Grey theory. This paper also presents the stability of the proposed algorithm from the viewpoint of Grey theory. The algorithm along with a critic-actor reinforcement learning model is implemented in a System-on-a-Programmable-Chip (SOPC) board. In addition to comparing with the renowned model, adaptive heuristic critic (AHC), the results of experiments demonstrate that the proposed control mechanism can learn to control a system with very little a priori knowledge. Meanwhile, the effect of uncertainty in interactions between the system and the environment can be relaxed to some extent in the learning process of the proposed reinforcement learning agent.

  • 出版日期2011-12