Near-Optimal Regret for Policy Optimization in Contextual MDPs with General Offline Function Approximation

📅 2026-02-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses policy optimization in stochastic contextual Markov decision processes (CMDPs) with general offline function approximation. It proposes the OPO-CMDP algorithm, which achieves— for the first time in CMDPs—optimal dependence on the cardinalities of the state space $|S|$ and action space $|A|$. The method integrates optimistic policy optimization with finite function classes to model both transition dynamics and rewards. Through high-probability analysis techniques, it attains a near-optimal regret bound of $\widetilde{O}(H^4 \sqrt{T|S||A| \log(|\mathcal{F}||\mathcal{P}|)})$, substantially improving upon existing results and establishing the theoretical and computational advantages of optimistic policy optimization in this setting.

Technology Category

Application Category

📝 Abstract
We introduce \texttt{OPO-CMDP}, the first policy optimization algorithm for stochastic Contextual Markov Decision Process (CMDPs) under general offline function approximation. Our approach achieves a high probability regret bound of $\widetilde{O}(H^4\sqrt{T|S||A|\log(|\mathcal{F}||\mathcal{P}|)}),$ where $S$ and $A$ denote the state and action spaces, $H$ the horizon length, $T$ the number of episodes, and $\mathcal{F}, \mathcal{P}$ the finite function classes used to approximate the losses and dynamics, respectively. This is the first regret bound with optimal dependence on $|S|$ and $|A|$, directly improving the current state-of-the-art (Qian, Hu, and Simchi-Levi, 2024). These results demonstrate that optimistic policy optimization provides a natural, computationally superior and theoretically near-optimal path for solving CMDPs.
Problem

Research questions and friction points this paper is trying to address.

Contextual MDPs
policy optimization
regret bound
offline function approximation
stochastic CMDPs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Policy Optimization
Contextual MDPs
Offline Function Approximation
Regret Bound
Optimistic Learning
🔎 Similar Papers
No similar papers found.