Scaling CrossQ with Weight Normalization

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reinforcement learning (RL) with high update-to-data (UTD) ratios frequently suffers from Q-value overestimation and critic network weight divergence, leading to training instability—particularly severe in the CrossQ framework. To address this, we propose the first systematic integration of weight normalization into CrossQ, which effectively curbs Q-value bias growth and stabilizes weight dynamics without requiring network resets, while preserving effective learning rates and policy plasticity. Evaluated on the DeepMind Control Suite—including high-dimensional continuous-control tasks such as *dog* and *humanoid*—our approach significantly improves sample efficiency and training stability, achieving state-of-the-art (SOTA) or superior performance relative to existing methods. Moreover, it enhances cross-task generalization capability. The method introduces minimal computational overhead and is fully compatible with standard CrossQ implementations, offering a principled and practical solution to UTD-induced instability in value-based deep RL.

Technology Category

Application Category

📝 Abstract
Reinforcement learning has achieved significant milestones, but sample efficiency remains a bottleneck for real-world applications. Recently, CrossQ has demonstrated state-of-the-art sample efficiency with a low update-to-data (UTD) ratio of 1. In this work, we explore CrossQ's scaling behavior with higher UTD ratios. We identify challenges in the training dynamics which are emphasized by higher UTDs, particularly Q-bias explosion and the growing magnitude of critic network weights. To address this, we integrate weight normalization into the CrossQ framework, a solution that stabilizes training, prevents potential loss of plasticity and keeps the effective learning rate constant. Our proposed approach reliably scales with increasing UTD ratios, achieving competitive or superior performance across a range of challenging tasks on the DeepMind control benchmark, notably the complex dog and humanoid environments. This work eliminates the need for drastic interventions, such as network resets, and offers a robust pathway for improving sample efficiency and scalability in model-free reinforcement learning.
Problem

Research questions and friction points this paper is trying to address.

Explores CrossQ's scaling with higher update-to-data ratios
Addresses Q-bias explosion and critic weight magnitude issues
Integrates weight normalization to stabilize training dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates weight normalization into CrossQ
Stabilizes training with constant learning rate
Scales effectively with higher UTD ratios
🔎 Similar Papers
No similar papers found.
Daniel Palenicek
Daniel Palenicek
PhD student at Technische Universität Darmstadt
Reinforcement LearningMachine LearningArtificial Intelligence
Florian Vogt
Florian Vogt
Master Student
Deep Reinforcement Learning
J
Jan Peters
Technical University of Darmstadt, hessian.AI, German Research Center for AI (DFKI), Robotics Institute Germany (RIG)