🤖 AI Summary
This work addresses policy optimization in stochastic contextual Markov decision processes (CMDPs) with general offline function approximation. It proposes the OPO-CMDP algorithm, which achieves— for the first time in CMDPs—optimal dependence on the cardinalities of the state space $|S|$ and action space $|A|$. The method integrates optimistic policy optimization with finite function classes to model both transition dynamics and rewards. Through high-probability analysis techniques, it attains a near-optimal regret bound of $\widetilde{O}(H^4 \sqrt{T|S||A| \log(|\mathcal{F}||\mathcal{P}|)})$, substantially improving upon existing results and establishing the theoretical and computational advantages of optimistic policy optimization in this setting.
📝 Abstract
We introduce \texttt{OPO-CMDP}, the first policy optimization algorithm for stochastic Contextual Markov Decision Process (CMDPs) under general offline function approximation. Our approach achieves a high probability regret bound of $\widetilde{O}(H^4\sqrt{T|S||A|\log(|\mathcal{F}||\mathcal{P}|)}),$ where $S$ and $A$ denote the state and action spaces, $H$ the horizon length, $T$ the number of episodes, and $\mathcal{F}, \mathcal{P}$ the finite function classes used to approximate the losses and dynamics, respectively. This is the first regret bound with optimal dependence on $|S|$ and $|A|$, directly improving the current state-of-the-art (Qian, Hu, and Simchi-Levi, 2024). These results demonstrate that optimistic policy optimization provides a natural, computationally superior and theoretically near-optimal path for solving CMDPs.