๐ค AI Summary
This work addresses the instability of temporal difference (TD) learning in offline reinforcement learning, where error amplification often leads to Q-value collapse. Viewing offline TD updates through the lens of control theory, the study models the learning process as a feedback system and revealsโ for the first timeโthat the dynamic properties of the Adam optimizer can directly induce or suppress such collapse. To mitigate error propagation, the authors propose AdamO, a decoupled orthogonal correction mechanism incorporating a task-aligned budget constraint. Theoretical guarantees on worst-case task safety are established via spectral radius stability analysis and continuous-time dissipative dynamics. Empirically, AdamO significantly enhances both stability and performance across diverse offline RL benchmarks while maintaining broad compatibility with existing algorithms.
๐ Abstract
Offline reinforcement learning (RL) can fail spectacularly when bootstrapped temporal-difference (TD) updates amplify their own errors, driving the critic toward extreme and unusable Q-values. A key counterintuitive insight of this work is that collapse is not only a property of the backup rule or network architecture: optimizer dynamics themselves can directly trigger or suppress instability. From a control-theoretic viewpoint, we model offline TD learning as a feedback system and analyze Adam-based critic updates. This yields a necessary and sufficient condition for stability of the induced local update dynamics: within the regime we analyze, these dynamics are stable if and only if the spectral radius of the corresponding update operator is strictly below one. Further analysis suggests that standard Adam updates can inadvertently distort the parameter geometry, motivating explicit orthogonality constraints to prevent TD error amplification. To this end, we propose AdamO, an Adam-based optimizer with a decoupled orthogonality correction regulated by a strict task-alignment budget. We prove that this design theoretically guarantees worst-case task safety and preserves Adam's continuous-time dissipative dynamics. Empirically, AdamO is broadly compatible with diverse offline RL baselines, improving stability and returns across a broad suite of benchmarks.