Provably Efficient Reinforcement Learning with Multinomial Logit Function Approximation

📅 2024-05-27

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 1

career value

208K/year

🤖 AI Summary

In Markov decision processes (MDPs), large state spaces induce prohibitive statistical and computational bottlenecks due to high data storage and processing overhead. Method: We propose a reinforcement learning framework based on multinomial logistic (MNL) function approximation, integrating improved optimistic policy iteration, trust-region updates, and online gradient estimation. Contribution/Results: Our approach is the first to eliminate explicit dependence on the problem-specific parameter κ⁻¹ in the regret bound. Theoretically, we establish the first lower bound matching both the feature dimension d and the number of episodes K, and achieve constant per-step computational complexity. We attain a regret upper bound of $widetilde{mathcal{O}}(dH^2sqrt{K} + d^2H^2)$, breaking the prior polynomial dependence on κ⁻¹. This result preserves statistical optimality while substantially improving computational efficiency.

Technology Category

Application Category

📝 Abstract

We study a new class of MDPs that employs multinomial logit (MNL) function approximation to ensure valid probability distributions over the state space. Despite its significant benefits, incorporating the non-linear function raises substantial challenges in both statistical and computational efficiency. The best-known result of Hwang and Oh [2023] has achieved an $widetilde{mathcal{O}}(kappa^{-1}dH^2sqrt{K})$ regret upper bound, where $kappa$ is a problem-dependent quantity, $d$ is the feature dimension, $H$ is the episode length, and $K$ is the number of episodes. However, we observe that $kappa^{-1}$ exhibits polynomial dependence on the number of reachable states, which can be as large as the state space size in the worst case and thus undermines the motivation for function approximation. Additionally, their method requires storing all historical data and the time complexity scales linearly with the episode count, which is computationally expensive. In this work, we propose a statistically efficient algorithm that achieves a regret of $widetilde{mathcal{O}}(dH^2sqrt{K} + kappa^{-1}d^2H^2)$, eliminating the dependence on $kappa^{-1}$ in the dominant term for the first time. We then address the computational challenges by introducing an enhanced algorithm that achieves the same regret guarantee but with only constant cost. Finally, we establish the first lower bound for this problem, justifying the optimality of our results in $d$ and $K$.

Problem

Research questions and friction points this paper is trying to address.

Markov Decision Processes

Statistical Complexity

Computational Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Polynomial Logarithmic Function

Markov Decision Process Optimization

Enhanced Computational Efficiency

🔎 Similar Papers

No similar papers found.