Learning Weakly Communicating Average-Reward CMDPs: Strong Duality and Improved Regret

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

260K/year

🤖 AI Summary

This work addresses the lack of strong duality and suboptimal regret bounds in existing algorithms for infinite-horizon average-reward constrained Markov decision processes (CMDPs) under weak communication assumptions. The authors establish strong duality for the first time in this setting and propose a primal-dual clipped value iteration algorithm. By approximating the infinite-horizon problem with a finite-horizon surrogate to stabilize dual variables, and leveraging the geometric structure of occupancy measures together with a novel Lagrangian regret decomposition technique, the method extends clipped value iteration to constrained reinforcement learning. Theoretical analysis shows that the algorithm simultaneously achieves $\widetilde{\mathcal{O}}(T^{2/3})$ regret and constraint violation bounds over $T$ interactions, significantly improving upon the best known results.

📝 Abstract

We study infinite-horizon average-reward constrained Markov decision processes (CMDPs) under the weakly communicating assumption. Our contributions are twofold. First, we establish strong duality for weakly communicating average-reward CMDPs over stationary policies with finite state and action spaces. Despite the absence of a linear programming formulation and the resulting nonconvexity under the weakly communicating setting, we show that strong duality still holds by carefully exploiting the geometric structure of the occupation measure set. Second, building on this result, we propose a primal--dual clipped value iteration algorithm for learning weakly communicating average-reward linear CMDPs. Our algorithm achieves regret and constraint violation bounds of $\widetilde{\mathcal{O}}(T^{2/3})$, improving upon the best known bounds, where $T$ denotes the number of interactions. Our approach extends clipped value iteration to the constrained setting and adapts it to a finite-horizon approximation, which stabilizes the dual variable and is crucial for achieving improved regret bounds. To analyze this, we develop a novel approach based on strong duality that enables the decomposition of the composite Lagrangian regret into separate bounds on regret and constraint violation.

Problem

Research questions and friction points this paper is trying to address.

constrained Markov decision processes

average-reward

weakly communicating

strong duality

regret

Innovation

Methods, ideas, or system contributions that make the work stand out.

strong duality

weakly communicating CMDPs

clipped value iteration