Lower Bound on Howard Policy Iteration for Deterministic Markov Decision Processes

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This paper establishes a tight lower bound on the iteration complexity of Howard’s policy iteration algorithm for solving the average-reward problem in deterministic Markov decision processes (DMDPs). While prior work only achieved a sublinear lower bound of $ ilde{Omega}(sqrt{I})$, where $I$ denotes the input size—significantly underestimating empirical performance—this work is the first to prove a linear lower bound of $ ilde{Omega}(I)$. Methodologically, we construct worst-case instances via graph-theoretic modeling: specifically, we design tailored policy graphs whose cycle-weight structures, combinatorial game-theoretic properties, and fixed-point iteration dynamics jointly enforce only constant progress per iteration. This construction rigorously captures the algorithm’s slowest possible convergence behavior. The result closes a fundamental gap in the theoretical understanding of policy iteration, providing the most precise characterization to date of convergence efficiency for average-reward optimization algorithms.

Technology Category

Application Category

📝 Abstract

Deterministic Markov Decision Processes (DMDPs) are a mathematical framework for decision-making where the outcomes and future possible actions are deterministically determined by the current action taken. DMDPs can be viewed as a finite directed weighted graph, where in each step, the controller chooses an outgoing edge. An objective is a measurable function on runs (or infinite trajectories) of the DMDP, and the value for an objective is the maximal cumulative reward (or weight) that the controller can guarantee. We consider the classical mean-payoff (aka limit-average) objective, which is a basic and fundamental objective. Howard's policy iteration algorithm is a popular method for solving DMDPs with mean-payoff objectives. Although Howard's algorithm performs well in practice, as experimental studies suggested, the best known upper bound is exponential and the current known lower bound is as follows: For the input size $I$, the algorithm requires $ ilde{Omega}(sqrt{I})$ iterations, where $ ilde{Omega}$ hides the poly-logarithmic factors, i.e., the current lower bound on iterations is sub-linear with respect to the input size. Our main result is an improved lower bound for this fundamental algorithm where we show that for the input size $I$, the algorithm requires $ ilde{Omega}(I)$ iterations.

Problem

Research questions and friction points this paper is trying to address.

Improving lower bound for Howard's policy iteration in DMDPs

Analyzing mean-payoff objectives in deterministic Markov decision processes

Establishing tighter iteration complexity for classical algorithm

Innovation

Methods, ideas, or system contributions that make the work stand out.

Improved lower bound for Howard's policy iteration

Deterministic Markov Decision Processes analysis

Mean-payoff objective optimization enhancement

🔎 Similar Papers

Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs