Approximate gradient methods in policy-space optimization of Markov reward processes

Marbach P<sup>*</sup>; Tsitsiklis JN

doi:10.1023/A:1022145020786

摘要

We consider a discrete time, finite state Markov reward process that depends on a set of parameters. We start with a brief review of (stochastic) gradient descent methods that tune the parameters in order to optimize the average reward, using a single (possibly simulated) sample path of the process of interest. The resulting algorithms can be implemented online, and have the property that the gradient of the average reward converges to zero with probability 1. On the other hand, the updates can have a high variance, resulting in slow convergence. We address this issue and propose two approaches to reduce the variance. These approaches rely on approximate gradient formulas, which introduce an additional bias into the update direction. We derive bounds for the resulting bias terms and characterize the asymptotic behavior of the resulting algorithms. For one of the approaches considered, the magnitude of the bias term exhibits an interesting dependence on the time it takes for the rewards to reach steady-state. We also apply the methodology to Markov reward processes with a reward-free termination state, and an expected total reward criterion. We use a call admission control problem to illustrate the performance of the proposed algorithms.

出版日期2003-4

全文

访问全文

收藏分享被引(24) 浏览

更新时间：2017-06-27 11:10

Approximate gradient methods in policy-space optimization of Markov reward processes

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友