Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

In deep reinforcement learning, target networks enhance training stability but impede convergence, whereas bootstrapping directly from the online network often causes divergence. To address this trade-off, we propose MINTO (Minimum Target), a lightweight and general-purpose method that computes the minimum of the target and online network outputs as the bootstrapping target during value updates. This mechanism preserves stability without sacrificing learning speed and requires no architectural or procedural modifications, making it seamlessly compatible with mainstream value-based (e.g., DQN) and policy-gradient (e.g., DDPG) algorithms. Extensive experiments across diverse settings—including discrete/continuous action spaces and online/offline RL benchmarks—demonstrate that MINTO consistently accelerates convergence, improves final performance, and effectively alleviates the inherent tension between target-network delay and online-network instability. As a result, MINTO provides a simple, universal, and computationally efficient solution for stable and accurate value estimation.

Technology Category

Application Category

📝 Abstract

The use of target networks is a popular approach for estimating value functions in deep Reinforcement Learning (RL). While effective, the target network remains a compromise solution that preserves stability at the cost of slowly moving targets, thus delaying learning. Conversely, using the online network as a bootstrapped target is intuitively appealing, albeit well-known to lead to unstable learning. In this work, we aim to obtain the best out of both worlds by introducing a novel update rule that computes the target using the MINimum estimate between the Target and Online network, giving rise to our method, MINTO. Through this simple, yet effective modification, we show that MINTO enables faster and stable value function learning, by mitigating the potential overestimation bias of using the online network for bootstrapping. Notably, MINTO can be seamlessly integrated into a wide range of value-based and actor-critic algorithms with a negligible cost. We evaluate MINTO extensively across diverse benchmarks, spanning online and offline RL, as well as discrete and continuous action spaces. Across all benchmarks, MINTO consistently improves performance, demonstrating its broad applicability and effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Mitigating overestimation bias in value function learning

Improving stability when using online networks for bootstrapping

Accelerating reinforcement learning without compromising stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

MINTO uses minimum of target and online networks

Mitigates overestimation bias for stable learning

Seamlessly integrates into value-based actor-critic algorithms

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL