摘要

In reinforcement learning problems with large-scale and continuous state or action spaces, the approximate reinforcement learning methods are proposed by using function approximation methods to fit the policy. The least-square approximation can extract more useful information from the samples and can be applied to the online algorithms effectively. Because of the complexity of the reinforcement learning problems, the samples generate from target policy cannot be used to evaluate target policy, so off-policy methods have to be used. Eligibility is usually able to accelerate the convergence of the algorithm. This paper proposed off-policy least square algorithms with eligibility trace based on importance reweighting: OFP-LSPE-Q, OFP-LSTD-Q. From the derivation, the algorithm indicates that the convergence rate of the algorithm will be faster with the increase of sample size in the case of off-policy, compared with the traditional least square reinforcement learning method.

全文