đ¤ AI Summary
This work addresses the limited understanding of training dynamics in deep neural networks with ReLU activations, particularly regarding how activation patterns evolve during optimization. The study proposes that training unfolds over two distinct time scales: an initial phase characterized by rapid changes in activation patterns, followed by a later phase where weights are fine-tuned within stable activation regions. Leveraging a geometric perspective, the authors develop a theoretical framework for activation pattern stability, supported by measure-theoretic analysis of local stability. They empirically track activation and weight trajectories across fully connected, convolutional, and Transformer architectures, revealing that activation patterns stabilize approximately three times earlier than weight updates converge. This consistent observationââactivations converge first, weights fine-tune laterââprovides a foundational insight for staged optimization strategies in deep learning.
đ Abstract
Despite the empirical success of DNN, their internal training dynamics remain difficult to characterize. In ReLU-based models, the activation pattern induced by a given input determines the piecewise-linear region in which the network behaves affinely. Motivated by this geometry, we investigate whether training exhibits a two-timescale behavior: an early stage with substantial changes in activation patterns and a later stage where weight updates predominantly refine the model within largely stable activation regimes. We first prove a local stability property: outside measure-zero sets of parameters and inputs, sufficiently small parameter perturbations preserve the activation pattern of a fixed input, implying locally affine behavior within activation regions. We then empirically track per-iteration changes in weights and activation patterns across fully-connected and convolutional architectures, as well as Transformer-based models, where activation patterns are recorded in the ReLU feed-forward (MLP/FFN) submodules, using fixed validation subsets. Across the evaluated settings, activation-pattern changes decay 3 times earlier than weight-update magnitudes, showing that late-stage training often proceeds within relatively stable activation regimes. These findings provide a concrete, architecture-agnostic instrument for monitoring training dynamics and motivate further study of decoupled optimization strategies for piecewise-linear networks. For reproducibility, code and experiment configurations will be released upon acceptance.