A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation

๐Ÿ“… 2023-11-26
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 11
โœจ Influential: 2
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In reinforcement learning for complex function approximation, balancing exploration and exploitation remains challenging, and simultaneously achieving high sample efficiency and low deployment cost is difficult. Method: This paper proposes Monotonic Q-learning with Upper Confidence Bound (MQL-UCB), the first approach to jointly integrate monotonic Q-function structure, variance-weighted least-squares regression, and a deterministic policy-switching mechanism, analyzed rigorously within the eluder-dimension framework. Contribution/Results: Theoretically, MQL-UCB achieves the first joint guarantee of near-optimal cumulative regret $ ilde{O}(dsqrt{HK})$ and near-optimal policy-switch count $ ilde{O}(dH)$ for nonlinear function classesโ€”resolving a long-standing open problem on simultaneously minimizing regret and switching cost. When $K$ is sufficiently large, the algorithm attains minimax-optimal performance, significantly improving both sample efficiency and practical deployability.
๐Ÿ“ Abstract
The exploration-exploitation dilemma has been a central challenge in reinforcement learning (RL) with complex model classes. In this paper, we propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for RL with general function approximation. Our key algorithmic design includes (1) a general deterministic policy-switching strategy that achieves low switching cost, (2) a monotonic value function structure with carefully controlled function class complexity, and (3) a variance-weighted regression scheme that exploits historical trajectories with high data efficiency. MQL-UCB achieves minimax optimal regret of $ ilde{O}(dsqrt{HK})$ when $K$ is sufficiently large and near-optimal policy switching cost of $ ilde{O}(dH)$, with $d$ being the eluder dimension of the function class, $H$ being the planning horizon, and $K$ being the number of episodes. Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.
Problem

Research questions and friction points this paper is trying to address.

Solving reinforcement learning with general function approximation efficiently
Minimizing policy switching costs while maintaining optimal regret
Designing provably sample-efficient Q-learning with nonlinear approximations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-switching policy strategy reduces deployment cost
Monotonic value structure controls function complexity
Variance-weighted regression improves data efficiency
๐Ÿ”Ž Similar Papers
No similar papers found.