🤖 AI Summary
To address the challenges of deploying large language models (LLMs) on resource-constrained devices and the high cost of cloud-based inference, this paper proposes a task-oriented, fine-tuning-free low-rank compression method. The core innovation lies in the first introduction of basis-space sparsity selection into LLM compression: leveraging low-rank decomposition to model the weight space, the method identifies redundant basis vectors irrelevant to a given downstream task (e.g., mathematical reasoning or code generation), and dynamically prunes them while enhancing task-relevant bases—yielding a task-adaptive low-rank representation. Crucially, weight reconstruction is achieved without any fine-tuning. Evaluated on Llama 2-7B and 13B, the method significantly reduces model size while matching state-of-the-art low-rank compression methods in accuracy on mathematical and coding benchmarks—achieving an effective balance between generalization capability and task specificity.
📝 Abstract
Large language models (LLMs) significantly enhance the performance of various applications, but they are computationally intensive and energy-demanding. This makes it challenging to deploy them on devices with limited resources, such as personal computers and mobile/wearable devices, and results in substantial inference costs in resource-rich environments like cloud servers. To extend the use of LLMs, we introduce a low-rank decomposition approach to effectively compress these models, tailored to the requirements of specific applications. We observe that LLMs pretrained on general datasets contain many redundant components not needed for particular applications. Our method focuses on identifying and removing these redundant parts, retaining only the necessary elements for the target applications. Specifically, we represent the weight matrices of LLMs as a linear combination of base components. We then prune the irrelevant bases and enhance the model with new bases beneficial for specific applications. Deep compression results on the Llama 2-7b and -13B models, conducted on target applications including mathematical reasoning and code generation, show that our method significantly reduces model size while maintaining comparable accuracy to state-of-the-art low-rank compression techniques.