From Information to Generative Exponent: Learning Rate Induces Phase Transitions in SGD

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This paper investigates how the learning rate governs a phase transition in SGD training of neural networks—from information-theoretic convergence (governed by mutual information constraints) to generative convergence (aligned with distributional recovery). Method: Leveraging analysis of the Gaussian single-index model, theoretical characterization of online SGD, and multi-scale gradient update modeling, we identify the learning rate as the critical control parameter triggering this transition. Based on this insight, we propose a layer-wise training algorithm operating on dual time scales, requiring neither sample reuse nor modification of the loss function. Contribution/Results: We establish, for the first time, a quantitative framework linking learning rate to sample complexity. Under appropriate adaptivity conditions, our method achieves a sharp transition from suboptimal information-theoretic convergence (O(1/ε²)) to optimal generative convergence (O(1/ε)), thereby significantly improving statistical efficiency and computational scalability.

Technology Category

Application Category

📝 Abstract

To understand feature learning dynamics in neural networks, recent theoretical works have focused on gradient-based learning of Gaussian single-index models, where the label is a nonlinear function of a latent one-dimensional projection of the input. While the sample complexity of online SGD is determined by the information exponent of the link function, recent works improved this by performing multiple gradient steps on the same sample with different learning rates -- yielding a non-correlational update rule -- and instead are limited by the (potentially much smaller) generative exponent. However, this picture is only valid when these learning rates are sufficiently large. In this paper, we characterize the relationship between learning rate(s) and sample complexity for a broad class of gradient-based algorithms that encapsulates both correlational and non-correlational updates. We demonstrate that, in certain cases, there is a phase transition from an "information exponent regime" with small learning rate to a "generative exponent regime" with large learning rate. Our framework covers prior analyses of one-pass SGD and SGD with batch reuse, while also introducing a new layer-wise training algorithm that leverages a two-timescales approach (via different learning rates for each layer) to go beyond correlational queries without reusing samples or modifying the loss from squared error. Our theoretical study demonstrates that the choice of learning rate is as important as the design of the algorithm in achieving statistical and computational efficiency.

Problem

Research questions and friction points this paper is trying to address.

Characterizes learning rate impact on SGD sample complexity

Identifies phase transitions between information and generative regimes

Proposes layer-wise training with two-timescale learning rates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learning rate induces phase transitions in SGD

Two-timescales approach uses layer-wise learning rates

Non-correlational updates reduce sample complexity requirements

🔎 Similar Papers

No similar papers found.