Provably Efficient Reinforcement Learning with Multinomial Logit Function Approximation

๐Ÿ“… 2024-05-27
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 3
โœจ Influential: 1
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In Markov decision processes (MDPs), large state spaces induce prohibitive statistical and computational bottlenecks due to high data storage and processing overhead. Method: We propose a reinforcement learning framework based on multinomial logistic (MNL) function approximation, integrating improved optimistic policy iteration, trust-region updates, and online gradient estimation. Contribution/Results: Our approach is the first to eliminate explicit dependence on the problem-specific parameter ฮบโปยน in the regret bound. Theoretically, we establish the first lower bound matching both the feature dimension d and the number of episodes K, and achieve constant per-step computational complexity. We attain a regret upper bound of $widetilde{mathcal{O}}(dH^2sqrt{K} + d^2H^2)$, breaking the prior polynomial dependence on ฮบโปยน. This result preserves statistical optimality while substantially improving computational efficiency.

Technology Category

Application Category

๐Ÿ“ Abstract
We study a new class of MDPs that employs multinomial logit (MNL) function approximation to ensure valid probability distributions over the state space. Despite its significant benefits, incorporating the non-linear function raises substantial challenges in both statistical and computational efficiency. The best-known result of Hwang and Oh [2023] has achieved an $widetilde{mathcal{O}}(kappa^{-1}dH^2sqrt{K})$ regret upper bound, where $kappa$ is a problem-dependent quantity, $d$ is the feature dimension, $H$ is the episode length, and $K$ is the number of episodes. However, we observe that $kappa^{-1}$ exhibits polynomial dependence on the number of reachable states, which can be as large as the state space size in the worst case and thus undermines the motivation for function approximation. Additionally, their method requires storing all historical data and the time complexity scales linearly with the episode count, which is computationally expensive. In this work, we propose a statistically efficient algorithm that achieves a regret of $widetilde{mathcal{O}}(dH^2sqrt{K} + kappa^{-1}d^2H^2)$, eliminating the dependence on $kappa^{-1}$ in the dominant term for the first time. We then address the computational challenges by introducing an enhanced algorithm that achieves the same regret guarantee but with only constant cost. Finally, we establish the first lower bound for this problem, justifying the optimality of our results in $d$ and $K$.
Problem

Research questions and friction points this paper is trying to address.

Markov Decision Processes
Statistical Complexity
Computational Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Polynomial Logarithmic Function
Markov Decision Process Optimization
Enhanced Computational Efficiency
๐Ÿ”Ž Similar Papers
No similar papers found.