摘要

In this note, we show that the evaluation phase in the policy iteration algorithm for the infinite horizon discounted Markov decision problem can be done in O(mN(2)) operations, where N is the number of states of the Markov decision process and m is the number of states in which the decision changes during the policy improvement phase.