A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation

📅 2023-11-26

🏛️ arXiv.org

📈 Citations: 11

✨ Influential: 2

career value

187K/year

🤖 AI Summary

In reinforcement learning for complex function approximation, balancing exploration and exploitation remains challenging, and simultaneously achieving high sample efficiency and low deployment cost is difficult. Method: This paper proposes Monotonic Q-learning with Upper Confidence Bound (MQL-UCB), the first approach to jointly integrate monotonic Q-function structure, variance-weighted least-squares regression, and a deterministic policy-switching mechanism, analyzed rigorously within the eluder-dimension framework. Contribution/Results: Theoretically, MQL-UCB achieves the first joint guarantee of near-optimal cumulative regret $ ilde{O}(dsqrt{HK})$ and near-optimal policy-switch count $ ilde{O}(dH)$ for nonlinear function classes—resolving a long-standing open problem on simultaneously minimizing regret and switching cost. When $K$ is sufficiently large, the algorithm attains minimax-optimal performance, significantly improving both sample efficiency and practical deployability.

📝 Abstract

The exploration-exploitation dilemma has been a central challenge in reinforcement learning (RL) with complex model classes. In this paper, we propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for RL with general function approximation. Our key algorithmic design includes (1) a general deterministic policy-switching strategy that achieves low switching cost, (2) a monotonic value function structure with carefully controlled function class complexity, and (3) a variance-weighted regression scheme that exploits historical trajectories with high data efficiency. MQL-UCB achieves minimax optimal regret of $ ilde{O}(dsqrt{HK})$ when $K$ is sufficiently large and near-optimal policy switching cost of $ ilde{O}(dH)$, with $d$ being the eluder dimension of the function class, $H$ being the planning horizon, and $K$ being the number of episodes. Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.

Problem

Research questions and friction points this paper is trying to address.

Solving reinforcement learning with general function approximation efficiently

Minimizing policy switching costs while maintaining optimal regret

Designing provably sample-efficient Q-learning with nonlinear approximations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-switching policy strategy reduces deployment cost

Monotonic value structure controls function complexity

Variance-weighted regression improves data efficiency

🔎 Similar Papers

No similar papers found.