Non-Rectangular Average-Reward Robust MDPs: Non-Rectangular Average-Reward Robust MDPs:Optimal Policies and Their Transient Values

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the limitations of robust Markov decision processes (MDPs) under non-rectangular uncertainty sets, where the lack of a dynamic programming principle and the exclusive pursuit of average optimality can lead to arbitrarily poor transient performance. Breaking away from the conventional rectangularity assumption, the paper establishes a minimax representation without relying on dynamic programming structure and proves the existence of robust optimal policies. It introduces the notion of “transient value” to expose the disconnect between average optimality and finite-horizon performance. Building on this insight, the authors design a piecewise policy that simultaneously guarantees robust average optimality and achieves constant-order transient regret. By integrating online learning, sequential hypothesis testing, and robust control, they convert high-probability regret bounds into expected bounds, thereby constructing—for the first time—a computationally feasible policy that attains both uniformly sublinear regret and strong transient performance.

Technology Category

Application Category

📝 Abstract

We study non-rectangular robust Markov decision processes under the average-reward criterion, where the ambiguity set couples transition probabilities across states and the adversary commits to a stationary kernel for the entire horizon. We show that any history-dependent policy achieving sublinear expected regret uniformly over the ambiguity set is robust-optimal, and that the robust value admits a minimax representation as the infimum over the ambiguity set of the classical optimal gains, without requiring any form of rectangularity or robust dynamic programming principle. Under the weak communication assumption, we establish the existence of such policies by converting high-probability regret bounds from the average-reward reinforcement learning literature into the expected-regret criterion. We then introduce a transient-value framework to evaluate finite-time performance of robust optimal policies, proving that average-reward optimality alone can mask arbitrarily poor transients and deriving regret-based lower bounds on transient values. Finally, we construct an epoch-based policy that combines an optimal stationary policy for the worst-case model with an anytime-valid sequential test and an online learning fallback, achieving a constant-order transient value.

Problem

Research questions and friction points this paper is trying to address.

Robust MDPs

Average-reward criterion

Non-rectangular ambiguity

Transient performance

Optimal policies

Innovation

Methods, ideas, or system contributions that make the work stand out.

non-rectangular robust MDPs

average-reward criterion

transient-value framework