🤖 AI Summary
This work addresses the overly conservative regret bounds of existing reinforcement learning algorithms in multinomial logit (MNL) Markov decision processes (MDPs). The authors propose the first variance-aware adaptive algorithm, which introduces a normalization constant capturing the variance of the optimal value function trajectory to construct variance-dependent confidence sets and exploration policies in episodic MNL-MDPs. Their theoretical analysis establishes matching upper and lower bounds, demonstrating that the algorithm achieves the minimax-optimal regret bound of $\widetilde{O}(d H^2 \bar{\sigma}_T \sqrt{T})$. Furthermore, under structured settings such as KL-divergence constraints, the dependence on the episode length $H$ is reduced by a factor, fully characterizing the regret complexity of MNL mixture MDPs.
📝 Abstract
We study reinforcement learning for episodic Markov Decision Processes (MDPs) whose transitions are modelled by a multinomial logistic (MNL) model. Existing algorithms for MNL mixture MDPs yield a regret of $\smash{\tilde{O}(dH^2\sqrt{T})}$ (Li et al., 2024), where $d$ is the feature dimension, $H$ the episode length, and $T$ the number of episodes. Inspired by the logistic bandit literature (Abeille et al., 2021; Faury et al., 2022; Boudart et al., 2026), we introduce a problem-dependent constant $\barσ\_T \leq 1/2$, measuring the normalised average variance of the optimal downstream value function along the learner's trajectory. We propose an algorithm achieving a regret of $\smash{\tilde{O}(dH^2\barσ\_T\sqrt{T})}$, which recovers the existing bound in the worst case and improves upon it for structured MDPs. For instance, for KL-constrained robust MDPs, $\barσ\_T = O(H^{-1})$, reducing the horizon dependence by a factor $H$. We further establish a matching $\smash{Ω(dH^2\barσ\_T\sqrt{T})}$ lower bound, proving minimax optimality (up to logarithmic factors) and fully characterising the regret complexity of MNL mixture MDPs for the first time.