Actor-Critic Learning Based on Adaptive Importance Sampling

Cheng Yuhu<sup>*</sup>; Feng Huanting; Wang Xuesong

摘要

For on-policy Actor-critic (AC) reinforcement learning, sampling is a time-consuming and expensive work In order to efficiently reuse previously collected samples and to reduce large estimation variance, a kind of off-policy AC learning algorithm based on an Adaptive importance sampling (AIS) technique is proposed The Critic estimates the value-function using the least squares temporal difference with eligibility trace and the AIS technique In order to control the trade-off between bias and variance of the estimation of policy gradient, a flattening factor is Introduced to the importance weight in the AIS The value of the flattening factor can be determined by an importance-weight cross-validation method automatically from samples and policies Based on the estimated policy gradient from the Critic, the Actor updates the policy parameter so as to obtain an optimal control policy Simulation results concerning a queuing problem illustrate that the AC learning based on AIS not only has good and stable learning performance but also has quick convergence speed

出版日期2010-10
单位中国矿业大学（北京）

收藏分享被引浏览

更新时间：2021-07-15 17:21

Actor-Critic Learning Based on Adaptive Importance Sampling

摘要

产品服务

站内浏览

服务支持

联系方式

科研之友