摘要

In the previous works, we have shown that policy iteration algorithms in performance optimization follow directly from performance difference formulas. In this paper, we show that based on this idea, we can develop policy iteration type of optimization algorithms for "policies" that depend on system parameters. We illustrate this idea with a load-dependent closed Jackson network, where the policy is different from that of standard Markov decision processes. First we establish the performance difference formula. Then we show that a service rate-based policy iteration algorithm can be developed using the aggregation of perturbation realization factors. The algorithm can be used to optimize the customer-average performance, which is another important performance metric compared with the traditional time-average performance. Sample path-based learning algorithm is also developed and it does not require the explicit knowledge of system parameters, such as the routing probability of queueing network. Finally, a numerical example is given to illustrate the efficiency of our algorithms. This approach can save computation because the space of parameter-based policies is smaller than that of state-based policies in standard Markov decision processes.