🤖 AI Summary
In deep reinforcement learning, target networks enhance training stability but impede convergence, whereas bootstrapping directly from the online network often causes divergence. To address this trade-off, we propose MINTO (Minimum Target), a lightweight and general-purpose method that computes the minimum of the target and online network outputs as the bootstrapping target during value updates. This mechanism preserves stability without sacrificing learning speed and requires no architectural or procedural modifications, making it seamlessly compatible with mainstream value-based (e.g., DQN) and policy-gradient (e.g., DDPG) algorithms. Extensive experiments across diverse settings—including discrete/continuous action spaces and online/offline RL benchmarks—demonstrate that MINTO consistently accelerates convergence, improves final performance, and effectively alleviates the inherent tension between target-network delay and online-network instability. As a result, MINTO provides a simple, universal, and computationally efficient solution for stable and accurate value estimation.
📝 Abstract
The use of target networks is a popular approach for estimating value functions in deep Reinforcement Learning (RL). While effective, the target network remains a compromise solution that preserves stability at the cost of slowly moving targets, thus delaying learning. Conversely, using the online network as a bootstrapped target is intuitively appealing, albeit well-known to lead to unstable learning. In this work, we aim to obtain the best out of both worlds by introducing a novel update rule that computes the target using the MINimum estimate between the Target and Online network, giving rise to our method, MINTO. Through this simple, yet effective modification, we show that MINTO enables faster and stable value function learning, by mitigating the potential overestimation bias of using the online network for bootstrapping. Notably, MINTO can be seamlessly integrated into a wide range of value-based and actor-critic algorithms with a negligible cost. We evaluate MINTO extensively across diverse benchmarks, spanning online and offline RL, as well as discrete and continuous action spaces. Across all benchmarks, MINTO consistently improves performance, demonstrating its broad applicability and effectiveness.