๐ค AI Summary
In reinforcement learning for complex function approximation, balancing exploration and exploitation remains challenging, and simultaneously achieving high sample efficiency and low deployment cost is difficult.
Method: This paper proposes Monotonic Q-learning with Upper Confidence Bound (MQL-UCB), the first approach to jointly integrate monotonic Q-function structure, variance-weighted least-squares regression, and a deterministic policy-switching mechanism, analyzed rigorously within the eluder-dimension framework.
Contribution/Results: Theoretically, MQL-UCB achieves the first joint guarantee of near-optimal cumulative regret $ ilde{O}(dsqrt{HK})$ and near-optimal policy-switch count $ ilde{O}(dH)$ for nonlinear function classesโresolving a long-standing open problem on simultaneously minimizing regret and switching cost. When $K$ is sufficiently large, the algorithm attains minimax-optimal performance, significantly improving both sample efficiency and practical deployability.
๐ Abstract
The exploration-exploitation dilemma has been a central challenge in reinforcement learning (RL) with complex model classes. In this paper, we propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for RL with general function approximation. Our key algorithmic design includes (1) a general deterministic policy-switching strategy that achieves low switching cost, (2) a monotonic value function structure with carefully controlled function class complexity, and (3) a variance-weighted regression scheme that exploits historical trajectories with high data efficiency. MQL-UCB achieves minimax optimal regret of $ ilde{O}(dsqrt{HK})$ when $K$ is sufficiently large and near-optimal policy switching cost of $ ilde{O}(dH)$, with $d$ being the eluder dimension of the function class, $H$ being the planning horizon, and $K$ being the number of episodes. Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.