🤖 AI Summary
This work addresses the limitation in conventional data-parallel training where all GPU replicas employ an identical learning rate, thereby constraining optimization efficacy and generalization. The authors propose Hyperparameter Divergent Ensemble Training (HDET), a novel approach that leverages existing data-parallel replicas to explore diverse learning rates across alternating divergence and convergence phases—without incurring any communication overhead. A gradient-free meta-controller dynamically adjusts the base learning rate schedule based on inter-replica loss discrepancies. HDET seamlessly integrates with mainstream schedulers such as PyTorch’s OneCycleLR and requires no additional training or hyperparameter tuning, yet consistently enhances optimization quality and generalization performance of large models. The method is also extensible to other scalar hyperparameters.
📝 Abstract
Training large neural networks with data-parallel stochastic gradient descent allocates N GPU replicas to compute effectively identical updates -- a practice that leaves the rich space of learning rate configurations entirely unexplored during training. We propose Hyperparameter-Divergent Ensemble Training (HDET), a method that repurposes these replicas for simultaneous learning rate exploration at negligible communication overhead. HDET operates in alternating phases: a fan-out stage in which replicas train independently under a structured, symmetric spread of learning rates, and a converge stage in which parameters are averaged across all replicas via AllReduce every T steps. Building on this ensemble substrate, we further propose an automatic learning rate (auto-LR) controller that treats the relative training loss across replicas as a performance signal, updating the shared base schedule toward higher-performing configurations via a momentum-based gradient-free meta-update. The combined method produces a self-adapting learning rate schedule that improves both optimization quality and generalization without additional hyperparameter sweeps or training budget.
Crucially, the framework generalizes beyond learning rate: any scalar hyperparameter that does not alter model architecture -- such as dropout rate, attention scale temperature, or weight-decay coefficient -- can be explored across replicas using the same fan-out/converge protocol, with inter-replica loss differences serving as zero-order hypergradients that guide the search direction. HDET is implemented as a drop-in replacement for PyTorch's OneCycleLR scheduler, requiring no changes to model architecture, optimizer, or data pipeline.