Function-Space Learning Rates

📅 2025-02-24

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This paper addresses the limitation that parameter-space learning rates in neural network training fail to reflect changes in function behavior. To this end, we propose **Functional-Space Learning Rate (FS-LR)**, which directly quantifies how parameter updates affect the output function. Methodologically, we first systematically define and efficiently estimate FS-LR via gradient sensitivity—computed through backpropagation—and layer-wise functional variation norms, enabling low-overhead dynamic measurement. Building upon this, we introduce the **FLeRM** framework, which adaptively rescales parameter-space learning rates to achieve hyperparameter transfer across model width, depth, initialization schemes, and LoRA rank—without manual tuning. Experiments on MLPs and Transformers demonstrate that FLeRM significantly improves hyperparameter transferability and training stability for large models, while incurring only negligible computational overhead—equivalent to a few additional backward passes.

Technology Category

Application Category

📝 Abstract

We consider layerwise function-space learning rates, which measure the magnitude of the change in a neural network's output function in response to an update to a parameter tensor. This contrasts with traditional learning rates, which describe the magnitude of changes in parameter space. We develop efficient methods to measure and set function-space learning rates in arbitrary neural networks, requiring only minimal computational overhead through a few additional backward passes that can be performed at the start of, or periodically during, training. We demonstrate two key applications: (1) analysing the dynamics of standard neural network optimisers in function space, rather than parameter space, and (2) introducing FLeRM (Function-space Learning Rate Matching), a novel approach to hyperparameter transfer across model scales. FLeRM records function-space learning rates while training a small, cheap base model, then automatically adjusts parameter-space layerwise learning rates when training larger models to maintain consistent function-space updates. FLeRM gives hyperparameter transfer across model width, depth, initialisation scale, and LoRA rank in various architectures including MLPs with residual connections and transformers with different layer normalisation schemes.

Problem

Research questions and friction points this paper is trying to address.

Measure function-space learning rates efficiently

Analyze neural network optimizers in function space

Enable hyperparameter transfer across model scales

Innovation

Methods, ideas, or system contributions that make the work stand out.

Function-space learning rates

Efficient measurement methods

FLeRM hyperparameter transfer

🔎 Similar Papers

Online Loss Function Learning