SPEQ: Stabilization Phases for Efficient Q-Learning in High Update-To-Data Ratio Reinforcement Learning

📅 2025-01-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

207K/year
🤖 AI Summary
In deep reinforcement learning, high update-to-data (UTD) ratios improve sample efficiency but incur substantial computational overhead. To address this, we propose a phased training framework that alternates between online low-UTD policy optimization and offline Q-function stabilization via fine-tuning—entirely without additional environment interaction. Our approach introduces the first training-phase separation mechanism, decoupling policy learning from Q-value stabilization to reduce computation while preserving sample efficiency. Moreover, it enables active optimization of replay buffer quality during the stabilization phase—a novel capability. Built upon the DroQ architecture, our method integrates twin Q-networks, soft target network updates, offline fine-tuning, and dynamic UTD scheduling. On continuous-control benchmarks, it achieves state-of-the-art performance while reducing gradient updates by 56% and halving training time, matching the sample efficiency of high-UTD methods.

Technology Category

Application Category

📝 Abstract
A key challenge in Deep Reinforcement Learning is sample efficiency, especially in real-world applications where collecting environment interactions is expensive or risky. Recent off-policy algorithms improve sample efficiency by increasing the Update-To-Data (UTD) ratio and performing more gradient updates per environment interaction. While this improves sample efficiency, it significantly increases computational cost due to the higher number of gradient updates required. In this paper we propose a sample-efficient method to improve computational efficiency by separating training into distinct learning phases in order to exploit gradient updates more effectively. Our approach builds on top of the Dropout Q-Functions (DroQ) algorithm and alternates between an online, low UTD ratio training phase, and an offline stabilization phase. During the stabilization phase, we fine-tune the Q-functions without collecting new environment interactions. This process improves the effectiveness of the replay buffer and reduces computational overhead. Our experimental results on continuous control problems show that our method achieves results comparable to state-of-the-art, high UTD ratio algorithms while requiring 56% fewer gradient updates and 50% less training time than DroQ. Our approach offers an effective and computationally economical solution while maintaining the same sample efficiency as the more costly, high UTD ratio state-of-the-art.
Problem

Research questions and friction points this paper is trying to address.

Improving computational efficiency in high UTD ratio reinforcement learning
Reducing gradient updates while maintaining sample efficiency
Balancing computational cost with performance in Q-learning algorithms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Separates training into distinct online and offline phases
Fine-tunes Q-functions without new environment interactions
Reduces computational overhead while maintaining sample efficiency