摘要

For on-policy Actor-critic (AC) reinforcement learning, sampling is a time-consuming and expensive work In order to efficiently reuse previously collected samples and to reduce large estimation variance, a kind of off-policy AC learning algorithm based on an Adaptive importance sampling (AIS) technique is proposed The Critic estimates the value-function using the least squares temporal difference with eligibility trace and the AIS technique In order to control the trade-off between bias and variance of the estimation of policy gradient, a flattening factor is Introduced to the importance weight in the AIS The value of the flattening factor can be determined by an importance-weight cross-validation method automatically from samples and policies Based on the estimated policy gradient from the Critic, the Actor updates the policy parameter so as to obtain an optimal control policy Simulation results concerning a queuing problem illustrate that the AC learning based on AIS not only has good and stable learning performance but also has quick convergence speed