Near-Optimal Regret for Policy Optimization in Contextual MDPs with General Offline Function Approximation

📅 2026-02-14

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses policy optimization in stochastic contextual Markov decision processes (CMDPs) with general offline function approximation. It proposes the OPO-CMDP algorithm, which achieves— for the first time in CMDPs—optimal dependence on the cardinalities of the state space $|S|$ and action space $|A|$. The method integrates optimistic policy optimization with finite function classes to model both transition dynamics and rewards. Through high-probability analysis techniques, it attains a near-optimal regret bound of $\widetilde{O}(H^4 \sqrt{T|S||A| \log(|\mathcal{F}||\mathcal{P}|)})$, substantially improving upon existing results and establishing the theoretical and computational advantages of optimistic policy optimization in this setting.

Technology Category

Application Category

📝 Abstract

We introduce \texttt{OPO-CMDP}, the first policy optimization algorithm for stochastic Contextual Markov Decision Process (CMDPs) under general offline function approximation. Our approach achieves a high probability regret bound of $\widetilde{O}(H^4\sqrt{T|S||A|\log(|\mathcal{F}||\mathcal{P}|)}),$ where $S$ and $A$ denote the state and action spaces, $H$ the horizon length, $T$ the number of episodes, and $\mathcal{F}, \mathcal{P}$ the finite function classes used to approximate the losses and dynamics, respectively. This is the first regret bound with optimal dependence on $|S|$ and $|A|$, directly improving the current state-of-the-art (Qian, Hu, and Simchi-Levi, 2024). These results demonstrate that optimistic policy optimization provides a natural, computationally superior and theoretically near-optimal path for solving CMDPs.

Problem

Research questions and friction points this paper is trying to address.

Contextual MDPs

policy optimization

regret bound

offline function approximation

stochastic CMDPs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Policy Optimization

Contextual MDPs

Offline Function Approximation