Target Networks and Over-parameterization Stabilize Off-policy Bootstrapping with Function Approximation

📅 2024-05-31
🏛️ International Conference on Machine Learning
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the instability of value estimation in off-policy bootstrapping caused by function approximation, focusing on the synergistic interplay between target networks and overparameterized linear function approximators. Theoretically, we provide the first rigorous proof that their joint use is necessary to ensure convergence under specific off-policy settings, and derive a high-probability upper bound on the value estimation error. Methodologically, we extend the framework to truncated trajectories, enabling stable learning for general tasks. Experiments on the Baird counterexample and the Four-room domain demonstrate that this combination significantly improves training stability and convergence robustness compared to baselines. Our results offer critical theoretical grounding for off-policy deep reinforcement learning—particularly DQN-style algorithms—and elucidate the fundamental role of target networks in overparameterized regimes.

Technology Category

Application Category

📝 Abstract
We prove that the combination of a target network and over-parameterized linear function approximation establishes a weaker convergence condition for bootstrapped value estimation in certain cases, even with off-policy data. Our condition is naturally satisfied for expected updates over the entire state-action space or learning with a batch of complete trajectories from episodic Markov decision processes. Notably, using only a target network or an over-parameterized model does not provide such a convergence guarantee. Additionally, we extend our results to learning with truncated trajectories, showing that convergence is achievable for all tasks with minor modifications, akin to value truncation for the final states in trajectories. Our primary result focuses on temporal difference estimation for prediction, providing high-probability value estimation error bounds and empirical analysis on Baird's counterexample and a Four-room task. Furthermore, we explore the control setting, demonstrating that similar convergence conditions apply to Q-learning.
Problem

Research questions and friction points this paper is trying to address.

Stabilizes off-policy bootstrapping with target networks and over-parameterization
Establishes convergence conditions for temporal difference value estimation
Extends convergence guarantees to truncated trajectory learning scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Target networks stabilize off-policy bootstrapping with function approximation
Over-parameterized linear models enable convergence with target networks
Combined approach ensures convergence for truncated trajectory learning
🔎 Similar Papers
No similar papers found.
Fengdi Che
Fengdi Che
university of alberta
artificial intelligence
C
Chenjun Xiao
School of Data Science, The Chinese University of Hong Kong, Shenzhen
Jincheng Mei
Jincheng Mei
Research Scientist, Google DeepMind
Machine LearningReinforcement LearningOptimization
B
Bo Dai
Google DeepMind; School of Computational Science and Engineering, Georgia Tech
Ramki Gummadi
Ramki Gummadi
Google DeepMind
O
Oscar A Ramirez
Figure; The work was done while the author was at Google.
C
Christopher K Harris
Uber
A
A. R. Mahmood
CIFAR AI Chair, Amii; Department of Computing Science, University of Alberta
D
D. Schuurmans
Google DeepMind; Department of Computing Science, University of Alberta