Training in reverse: How iteration order influences convergence and stability in deep learning

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work identifies the gradient update order itself as a critical factor influencing deep learning optimization dynamics. To address this, we propose backward stochastic gradient descent (backward-SGD), which applies parameter updates in reverse mini-batch order. Theoretically, we prove that within contraction mapping regions, backward-SGD converges deterministically to a unique fixed point, whereas standard SGD only converges in distribution—yielding superior deterministic convergence and stability without requiring learning rate scheduling or data augmentation. Empirically, under constant learning rates and limited-data regimes, backward-SGD significantly improves training robustness and final test accuracy. Crucially, this study establishes update ordering as an independent, controllable optimization dimension—introducing “order modulation” as a lightweight, hyperparameter-free paradigm for enhancing training stability.

Technology Category

Application Category

📝 Abstract

Despite exceptional achievements, training neural networks remains computationally expensive and is often plagued by instabilities that can degrade convergence. While learning rate schedules can help mitigate these issues, finding optimal schedules is time-consuming and resource-intensive. This work explores theoretical issues concerning training stability in the constant-learning-rate (i.e., without schedule) and small-batch-size regime. Surprisingly, we show that the order of gradient updates affects stability and convergence in gradient-based optimizers. We illustrate this new line of thinking using backward-SGD, which processes batch gradient updates like SGD but in reverse order. Our theoretical analysis shows that in contractive regions (e.g., around minima) backward-SGD converges to a point while the standard forward-SGD generally only converges to a distribution. This leads to improved stability and convergence which we demonstrate experimentally. While full backward-SGD is computationally intensive in practice, it highlights opportunities to exploit reverse training dynamics (or more generally alternate iteration orders) to improve training. To our knowledge, this represents a new and unexplored avenue in deep learning optimization.

Problem

Research questions and friction points this paper is trying to address.

Deep Learning Optimization

Training Stability

Learning Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient Calculation Order

Reversed SGD

Deep Learning Optimization

🔎 Similar Papers

Empirical Tests of Optimization Assumptions in Deep Learning