🤖 AI Summary
To address the challenge of fine-tuning large language models under memory constraints, this paper proposes a dynamic sparse fine-tuning method. Observing that layer importance is architecture-dependent and estimable a priori, the method introduces a dynamic random channel selection mechanism: channels are stochastically resampled across epochs within pre-identified critical layers, while activations and weight gradients are jointly sparsified. Crucially, it requires no additional parameters or architectural modifications, substantially reducing memory and computational overhead. Experiments across diverse downstream tasks and mainstream architectures demonstrate state-of-the-art performance. The approach achieves up to 99% activation sparsity, 95% weight gradient sparsity, and a 97% reduction in gradient-computation FLOPs—delivering exceptional efficiency without compromising accuracy.
📝 Abstract
Memory-efficient training of deep neural networks has become increasingly important as models grow larger while deployment environments impose strict resource constraints. We propose TraDy, a novel transfer learning scheme leveraging two key insights: layer importance for updates is architecture-dependent and determinable a priori, while dynamic stochastic channel selection provides superior gradient approximation compared to static approaches. We introduce a dynamic channel selection approach that stochastically resamples channels between epochs within preselected layers. Extensive experiments demonstrate TraDy achieves state-of-the-art performance across various downstream tasks and architectures while maintaining strict memory constraints, achieving up to 99% activation sparsity, 95% weight derivative sparsity, and 97% reduction in FLOPs for weight derivative computation.