Understanding the Mechanisms of Fast Hyperparameter Transfer

📅 2025-12-27

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Scaling up deep learning models drastically increases hyperparameter tuning costs. Method: We propose “fast hyperparameter transfer”—efficiently transferring optimal hyperparameters from small-scale grid search to large-scale models. We formally define fast transfer and establish its theoretical equivalence criterion; introduce the optimization trajectory decomposition hypothesis, revealing the structural roles of width-stable versus width-sensitive loss components under the μP framework, thereby characterizing the essential conditions for successful transfer. Results: Through asymptotic modeling, computational complexity analysis, synthetic experiments, and empirical LLM pretraining (including μP), we demonstrate that fast transfer is equivalent to transferring computational advantage, and that problem structure—not scale—determines transfer success. Our core contribution is a verifiable theoretical foundation and practical guidelines for cross-scale hyperparameter transfer.

Technology Category

Application Category

📝 Abstract

The growing scale of deep learning models has rendered standard hyperparameter (HP) optimization prohibitively expensive. A promising solution is the use of scale-aware hyperparameters, which can enable direct transfer of optimal HPs from small-scale grid searches to large models with minimal performance loss. To understand the principles governing such transfer strategy, we develop a general conceptual framework for reasoning about HP transfer across scale, characterizing transfer as fast when the suboptimality it induces vanishes asymptotically faster than the finite-scale performance gap. We show formally that fast transfer is equivalent to useful transfer for compute-optimal grid search, meaning that transfer is asymptotically more compute-efficient than direct tuning. While empirical work has found that the Maximal Update Parameterization ($μ$P) exhibits fast transfer when scaling model width, the mechanisms remain poorly understood. We show that this property depends critically on problem structure by presenting synthetic settings where transfer either offers provable computational advantage or fails to outperform direct tuning even under $μ$P. To explain the fast transfer observed in practice, we conjecture that decomposing the optimization trajectory reveals two contributions to loss reduction: (1) a width-stable component that determines the optimal HPs, and (2) a width-sensitive component that improves with width but weakly perturbs the HP optimum. We present empirical evidence for this hypothesis across various settings, including large language model pretraining.

Problem

Research questions and friction points this paper is trying to address.

Develop a framework for fast hyperparameter transfer across model scales

Identify conditions where hyperparameter transfer outperforms direct tuning

Explain mechanisms of fast transfer in neural network optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scale-aware hyperparameters enable direct transfer from small to large models

Fast transfer defined as suboptimality vanishing faster than performance gap

Decomposing optimization into width-stable and width-sensitive loss components

🔎 Similar Papers

Revisiting Prefix-tuning: Statistical Benefits of Reparameterization among Prompts