The regret lower bound for communicating Markov Decision Processes

📅 2025-01-22

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This paper establishes a tight Ω(log T) regret lower bound for communication-constrained Markov decision processes (Com-MDPs) under problem-dependent settings, providing a fundamental performance benchmark for multi-agent cooperative decision-making in complex strategic interactions. To address the structural complexity of Com-MDPs beyond ergodic MDPs, the authors introduce and formalize the “co-exploration” phenomenon—where optimal state-action pairs must be over-sampled to ensure global coordination across agents. Their analysis tightly couples exploration behavior with the underlying communication graph topology by integrating graph navigation constraints and logarithmic-scale structural reasoning. Theoretically, they prove that computing this lower bound is both Σ₂^P-hard and coNP-hard. Furthermore, they propose a constructive approximation algorithm that preserves analytical tractability while recovering known regret lower bounds for standard MDPs as a special case.

Technology Category

Application Category

📝 Abstract

This paper is devoted to the extension of the regret lower bound beyond ergodic Markov decision processes (MDPs) in the problem dependent setting. While the regret lower bound for ergodic MDPs is well-known and reached by tractable algorithms, we prove that the regret lower bound becomes significatively more complex in communicating MDPs. Our lower bound revisits the necessary explorative behavior of consistent learning agents and further explains that all optimal regions of the environment must be overvisited compared to sub-optimal ones, a phenomenon that we refer to as co-exploration. In tandem, we show that these two explorative and co-explorative behaviors are intertwined with navigation constraints obtained by scrutinizing the navigation structure at logarithmic scale. The resulting lower bound is expressed as the solution of an optimization problem that, in many standard classes of MDPs, can be specialized to recover existing results. From a computational perspective, it is provably $Sigma_2^ extrm{P}$-hard in general and as a matter of fact, even testing the membership to the feasible region is coNP-hard. We further provide an algorithm to approximate the lower bound in a constructive way.

Problem

Research questions and friction points this paper is trying to address.

Complex Game Decisions

Communicating Markov Decision Processes

Regret Minimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Exploration Complexity

Algorithm Estimation

Game Decision Making

🔎 Similar Papers

On Bits and Bandits: Quantifying the Regret-Information Trade-off