🤖 AI Summary
To address internal inconsistency and task performance degradation caused by layer-wise pruning in large language models (LLMs), this paper proposes Taco-SVD, a task-aware singular value decomposition framework. Methodologically, Taco-SVD innovatively couples gradient-based attribution with singular vector selection: it first identifies task-critical linear transformation directions via lightweight attribution mapping; then selects gradient-weighted singular vectors to preserve task-sensitive components; finally enforces inter-layer consistency constraints to ensure structural stability post-compression. The framework is architecture-agnostic and requires no fine-tuning for deployment. Experiments across diverse LLMs demonstrate that Taco-SVD consistently reduces perplexity and improves downstream task accuracy by 5.2% on average, while increasing computational overhead by less than 0.3%. Its key contributions include the first integration of gradient attribution into SVD direction selection, the introduction of layer-wise consistency regularization for compressed LLMs, and a plug-and-play compression method achieving high task fidelity with minimal computational cost.
📝 Abstract
Layer removal has emerged as a promising approach for compressing large language models (LLMs) by leveraging redundancy within layers to reduce model size and accelerate inference. However, this technique often compromises internal consistency, leading to performance degradation and instability, with varying impacts across different model architectures. In this work, we propose Taco-SVD, a task-aware framework that retains task-critical singular value directions, preserving internal consistency while enabling efficient compression. Unlike direct layer removal, Taco-SVD preserves task-critical transformations to mitigate performance degradation. By leveraging gradient-based attribution methods, Taco-SVD aligns singular values with downstream task objectives. Extensive evaluations demonstrate that Taco-SVD outperforms existing methods in perplexity and task performance across different architectures while ensuring minimal computational overhead.