Study of Training Dynamics for Memory-Constrained Fine-Tuning

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

To address the challenge of fine-tuning large language models under memory constraints, this paper proposes a dynamic sparse fine-tuning method. Observing that layer importance is architecture-dependent and estimable a priori, the method introduces a dynamic random channel selection mechanism: channels are stochastically resampled across epochs within pre-identified critical layers, while activations and weight gradients are jointly sparsified. Crucially, it requires no additional parameters or architectural modifications, substantially reducing memory and computational overhead. Experiments across diverse downstream tasks and mainstream architectures demonstrate state-of-the-art performance. The approach achieves up to 99% activation sparsity, 95% weight gradient sparsity, and a 97% reduction in gradient-computation FLOPs—delivering exceptional efficiency without compromising accuracy.

Technology Category

Application Category

📝 Abstract

Memory-efficient training of deep neural networks has become increasingly important as models grow larger while deployment environments impose strict resource constraints. We propose TraDy, a novel transfer learning scheme leveraging two key insights: layer importance for updates is architecture-dependent and determinable a priori, while dynamic stochastic channel selection provides superior gradient approximation compared to static approaches. We introduce a dynamic channel selection approach that stochastically resamples channels between epochs within preselected layers. Extensive experiments demonstrate TraDy achieves state-of-the-art performance across various downstream tasks and architectures while maintaining strict memory constraints, achieving up to 99% activation sparsity, 95% weight derivative sparsity, and 97% reduction in FLOPs for weight derivative computation.

Problem

Research questions and friction points this paper is trying to address.

Memory-efficient fine-tuning of large neural networks under strict resource constraints

Dynamic channel selection for superior gradient approximation in transfer learning

Achieving high sparsity and computational efficiency while maintaining model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic stochastic channel selection for gradient approximation

Layer importance determined a priori for architecture-dependent updates

Achieves high sparsity in activations and weight derivatives

🔎 Similar Papers

MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning