A Bit of Freedom Goes a Long Way: Classical and Quantum Algorithms for Reinforcement Learning under a Generative Model

📅 2025-07-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses online learning in average-reward Markov decision processes (MDPs) under a generative model. We propose a novel classical-quantum hybrid algorithmic framework. For finite-horizon settings, we design the first quantum algorithm with time-dependence O(log T), breaking the classical Ω(√T) regret lower bound. In the infinite-horizon case, we introduce a new regret metric enabling polylog(T) regret for quantum algorithms—an exponential speedup over classical counterparts. Methodologically, our approach integrates generative model interaction, quantum amplitude estimation, and function approximation to circumvent traditional exploration bottlenecks and support generalization over compact state spaces. Theoretical analysis shows that both algorithms significantly improve dependence on the number of states S and actions A. Notably, the quantum variant achieves optimal or near-optimal regret bounds in all dimensions—S, A, and T—providing the first systematic solution for quantum reinforcement learning that combines rigorous theoretical guarantees with practical algorithmic structure.

Technology Category

Application Category

📝 Abstract
We propose novel classical and quantum online algorithms for learning finite-horizon and infinite-horizon average-reward Markov Decision Processes (MDPs). Our algorithms are based on a hybrid exploration-generative reinforcement learning (RL) model wherein the agent can, from time to time, freely interact with the environment in a generative sampling fashion, i.e., by having access to a "simulator". By employing known classical and new quantum algorithms for approximating optimal policies under a generative model within our learning algorithms, we show that it is possible to avoid several paradigms from RL like "optimism in the face of uncertainty" and "posterior sampling" and instead compute and use optimal policies directly, which yields better regret bounds compared to previous works. For finite-horizon MDPs, our quantum algorithms obtain regret bounds which only depend logarithmically on the number of time steps $T$, thus breaking the $O(sqrt{T})$ classical barrier. This matches the time dependence of the prior quantum works of Ganguly et al. (arXiv'23) and Zhong et al. (ICML'24), but with improved dependence on other parameters like state space size $S$ and action space size $A$. For infinite-horizon MDPs, our classical and quantum bounds still maintain the $O(sqrt{T})$ dependence but with better $S$ and $A$ factors. Nonetheless, we propose a novel measure of regret for infinite-horizon MDPs with respect to which our quantum algorithms have $operatorname{poly}log{T}$ regret, exponentially better compared to classical algorithms. Finally, we generalise all of our results to compact state spaces.
Problem

Research questions and friction points this paper is trying to address.

Develop classical and quantum algorithms for reinforcement learning under generative models
Improve regret bounds for finite and infinite-horizon Markov Decision Processes
Achieve logarithmic quantum regret bounds surpassing classical limits
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid exploration-generative RL model
Quantum algorithms for logarithmic regret
Improved classical and quantum regret bounds
🔎 Similar Papers
No similar papers found.