XQC: Well-conditioned Optimization Accelerates Deep Reinforcement Learning

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

To address the low sample efficiency prevalent in deep reinforcement learning, this paper proposes an optimization-aware efficient algorithm that specifically improves the optimization landscape of the critic network—achieved by reducing the condition number of its Hessian matrix to enhance training stability and accelerate convergence. Methodologically, we introduce, for the first time within the soft actor-critic framework, a joint integration of batch normalization, weight normalization, and distributional cross-entropy loss, which naturally bounds gradient norms; network architecture design is further guided by spectral analysis of the Hessian. Evaluated on 55 proprioceptive and 15 vision-based continuous control tasks, our method achieves state-of-the-art sample efficiency while employing significantly fewer parameters than mainstream approaches—demonstrating a novel paradigm wherein lightweight architectural design directly improves optimization quality.

Technology Category

Application Category

📝 Abstract

Sample efficiency is a central property of effective deep reinforcement learning algorithms. Recent work has improved this through added complexity, such as larger models, exotic network architectures, and more complex algorithms, which are typically motivated purely by empirical performance. We take a more principled approach by focusing on the optimization landscape of the critic network. Using the eigenspectrum and condition number of the critic's Hessian, we systematically investigate the impact of common architectural design decisions on training dynamics. Our analysis reveals that a novel combination of batch normalization (BN), weight normalization (WN), and a distributional cross-entropy (CE) loss produces condition numbers orders of magnitude smaller than baselines. This combination also naturally bounds gradient norms, a property critical for maintaining a stable effective learning rate under non-stationary targets and bootstrapping. Based on these insights, we introduce XQC: a well-motivated, sample-efficient deep actor-critic algorithm built upon soft actor-critic that embodies these optimization-aware principles. We achieve state-of-the-art sample efficiency across 55 proprioception and 15 vision-based continuous control tasks, all while using significantly fewer parameters than competing methods.

Problem

Research questions and friction points this paper is trying to address.

Investigating optimization landscape issues in deep reinforcement learning critic networks

Addressing poor conditioning and gradient instability in actor-critic algorithms

Improving sample efficiency through better-conditioned optimization rather than added complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Batch normalization and weight normalization improve optimization

Distributional cross-entropy loss reduces critic network condition number

Well-conditioned actor-critic algorithm enhances sample efficiency

🔎 Similar Papers

Iterated $Q$-Network: Beyond One-Step Bellman Updates in Deep Reinforcement Learning