🤖 AI Summary
This work investigates the proportion of parameters in neural networks that are truly necessary for encoding task-specific information and proposes a method that trains only extremely low-rank LoRA adapters while keeping the backbone network entirely frozen and randomly initialized. The approach is validated across diverse architectures and tasks, revealing that task-relevant information resides in an exceptionally low-dimensional subspace. This finding implies that randomly initialized backbones are interchangeable and need only be distributed as random seeds. By linking the saturation rank of LoRA to the intrinsic dimensionality of tasks, the method recovers 96%–100% of full fine-tuning performance using merely 0.5%–40% trainable parameters across nine benchmarks, substantially reducing storage and memory overhead.
📝 Abstract
How many of a neural network's parameters actually encode task-specific information? We investigate this question with LottaLoRA, a training paradigm in which every backbone weight is drawn at random and frozen; only low-rank LoRA adapters are trained. Across nine benchmarks spanning diverse architecture families from single-layer classifiers to 900M parameter Transformers low-rank adapters over frozen random backbones recover 96-100% of fully trained performance while training only 0.5-40% of the parameters. The task-specific signal therefore occupies a subspace orders of magnitude smaller than the full parameter count suggests.Three mechanistic findings underpin this result:(1) the frozen backbone is actively exploited when static the learned scaling~$\beta$ remains strictly positive across all architectures but when the scaffold is destabilized, the optimizer silences it and the LoRA factors absorb all task information; (2) the frozen backbone is preferable but interchangeable any random initialization works equally well, provided it remains fixed throughout training; and (3) the minimum LoRA rank at which performance saturates estimates the intrinsic dimensionality of the task, reminiscent of the number of components retained in Principal Component Analysis (PCA). The construction is formally analogous to Reservoir Computing unfolded along the depth axis of a feedforward network. Because the backbone is determined by a random seed alone, models can be distributed as adapters plus seed a footprint that grows with task complexity, not model size, so that storage and memory savings compound as architectures scale.