🤖 AI Summary
The conventional “fine-tune-then-compress” paradigm for lightweighting large language models (LLMs) during post-training incurs significant performance degradation and introduces redundant intermediate models. Method: This paper proposes the first end-to-end framework that jointly optimizes fine-tuning and structured compression—integrating progressive knowledge distillation, dynamic structured pruning, and low-rank parameter constraints directly into the downstream fine-tuning process to cooperatively shrink the parameter space. Contribution/Results: By eliminating the need to store and compute full-sized intermediate models, our approach reduces memory and computational overhead. On multiple benchmark tasks, it achieves an average accuracy gain of 2.1% at equivalent parameter counts and compresses model size by up to 4.3×, substantially mitigating performance decay inherent in conventional lightweighting pipelines.
📝 Abstract
To reduce model size during post-training, compression methods, including knowledge distillation, low-rank approximation, and pruning, are often applied after fine-tuning the model. However, sequential fine-tuning and compression sacrifices performance, while creating a larger than necessary model as an intermediate step. In this work, we aim to reduce this gap, by directly constructing a smaller model while guided by the downstream task. We propose to jointly fine-tune and compress the model by gradually distilling it to a pruned low-rank structure. Experiments demonstrate that joint fine-tuning and compression significantly outperforms other sequential compression methods.