Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work addresses the overly conservative regret bounds of existing reinforcement learning algorithms in multinomial logit (MNL) Markov decision processes (MDPs). The authors propose the first variance-aware adaptive algorithm, which introduces a normalization constant capturing the variance of the optimal value function trajectory to construct variance-dependent confidence sets and exploration policies in episodic MNL-MDPs. Their theoretical analysis establishes matching upper and lower bounds, demonstrating that the algorithm achieves the minimax-optimal regret bound of $\widetilde{O}(d H^2 \bar{\sigma}_T \sqrt{T})$. Furthermore, under structured settings such as KL-divergence constraints, the dependence on the episode length $H$ is reduced by a factor, fully characterizing the regret complexity of MNL mixture MDPs.

📝 Abstract

We study reinforcement learning for episodic Markov Decision Processes (MDPs) whose transitions are modelled by a multinomial logistic (MNL) model. Existing algorithms for MNL mixture MDPs yield a regret of $\smash{\tilde{O}(dH^2\sqrt{T})}$ (Li et al., 2024), where $d$ is the feature dimension, $H$ the episode length, and $T$ the number of episodes. Inspired by the logistic bandit literature (Abeille et al., 2021; Faury et al., 2022; Boudart et al., 2026), we introduce a problem-dependent constant $\barσ\_T \leq 1/2$, measuring the normalised average variance of the optimal downstream value function along the learner's trajectory. We propose an algorithm achieving a regret of $\smash{\tilde{O}(dH^2\barσ\_T\sqrt{T})}$, which recovers the existing bound in the worst case and improves upon it for structured MDPs. For instance, for KL-constrained robust MDPs, $\barσ\_T = O(H^{-1})$, reducing the horizon dependence by a factor $H$. We further establish a matching $\smash{Ω(dH^2\barσ\_T\sqrt{T})}$ lower bound, proving minimax optimality (up to logarithmic factors) and fully characterising the regret complexity of MNL mixture MDPs for the first time.

Problem

Research questions and friction points this paper is trying to address.

Multinomial Logistic MDPs

regret bounds

variance-aware

minimax optimality

reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

variance-aware regret

multinomial logistic MDP

minimax optimality