Study of Training Dynamics for Memory-Constrained Fine-Tuning

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of fine-tuning large language models under memory constraints, this paper proposes a dynamic sparse fine-tuning method. Observing that layer importance is architecture-dependent and estimable a priori, the method introduces a dynamic random channel selection mechanism: channels are stochastically resampled across epochs within pre-identified critical layers, while activations and weight gradients are jointly sparsified. Crucially, it requires no additional parameters or architectural modifications, substantially reducing memory and computational overhead. Experiments across diverse downstream tasks and mainstream architectures demonstrate state-of-the-art performance. The approach achieves up to 99% activation sparsity, 95% weight gradient sparsity, and a 97% reduction in gradient-computation FLOPs—delivering exceptional efficiency without compromising accuracy.

Technology Category

Application Category

📝 Abstract
Memory-efficient training of deep neural networks has become increasingly important as models grow larger while deployment environments impose strict resource constraints. We propose TraDy, a novel transfer learning scheme leveraging two key insights: layer importance for updates is architecture-dependent and determinable a priori, while dynamic stochastic channel selection provides superior gradient approximation compared to static approaches. We introduce a dynamic channel selection approach that stochastically resamples channels between epochs within preselected layers. Extensive experiments demonstrate TraDy achieves state-of-the-art performance across various downstream tasks and architectures while maintaining strict memory constraints, achieving up to 99% activation sparsity, 95% weight derivative sparsity, and 97% reduction in FLOPs for weight derivative computation.
Problem

Research questions and friction points this paper is trying to address.

Memory-efficient fine-tuning of large neural networks under strict resource constraints
Dynamic channel selection for superior gradient approximation in transfer learning
Achieving high sparsity and computational efficiency while maintaining model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic stochastic channel selection for gradient approximation
Layer importance determined a priori for architecture-dependent updates
Achieves high sparsity in activations and weight derivatives
🔎 Similar Papers
No similar papers found.
A
Aël Quélennec
LTCI, Télécom Paris, Institut Polytechnique de Paris
N
Nour Hezbri
LTCI, Télécom Paris, Institut Polytechnique de Paris
Pavlo Mozharovskyi
Pavlo Mozharovskyi
LTCI, Telecom Paris, Institut Polytechnique de Paris
machine learningcomputational statisticsdata depthinterpretability of AIfunctional data analysis
V
Van-Tam Nguyen
LTCI, Télécom Paris, Institut Polytechnique de Paris
Enzo Tartaglione
Enzo Tartaglione
Associate Professor, Télécom Paris, Institut Polytechnique de Paris
deep learningcompressionpruningdebiasingfrugal AI