๐ค AI Summary
In Markov decision processes (MDPs), large state spaces induce prohibitive statistical and computational bottlenecks due to high data storage and processing overhead. Method: We propose a reinforcement learning framework based on multinomial logistic (MNL) function approximation, integrating improved optimistic policy iteration, trust-region updates, and online gradient estimation. Contribution/Results: Our approach is the first to eliminate explicit dependence on the problem-specific parameter ฮบโปยน in the regret bound. Theoretically, we establish the first lower bound matching both the feature dimension d and the number of episodes K, and achieve constant per-step computational complexity. We attain a regret upper bound of $widetilde{mathcal{O}}(dH^2sqrt{K} + d^2H^2)$, breaking the prior polynomial dependence on ฮบโปยน. This result preserves statistical optimality while substantially improving computational efficiency.
๐ Abstract
We study a new class of MDPs that employs multinomial logit (MNL) function approximation to ensure valid probability distributions over the state space. Despite its significant benefits, incorporating the non-linear function raises substantial challenges in both statistical and computational efficiency. The best-known result of Hwang and Oh [2023] has achieved an $widetilde{mathcal{O}}(kappa^{-1}dH^2sqrt{K})$ regret upper bound, where $kappa$ is a problem-dependent quantity, $d$ is the feature dimension, $H$ is the episode length, and $K$ is the number of episodes. However, we observe that $kappa^{-1}$ exhibits polynomial dependence on the number of reachable states, which can be as large as the state space size in the worst case and thus undermines the motivation for function approximation. Additionally, their method requires storing all historical data and the time complexity scales linearly with the episode count, which is computationally expensive. In this work, we propose a statistically efficient algorithm that achieves a regret of $widetilde{mathcal{O}}(dH^2sqrt{K} + kappa^{-1}d^2H^2)$, eliminating the dependence on $kappa^{-1}$ in the dominant term for the first time. We then address the computational challenges by introducing an enhanced algorithm that achieves the same regret guarantee but with only constant cost. Finally, we establish the first lower bound for this problem, justifying the optimality of our results in $d$ and $K$.