🤖 AI Summary
Deploying large language models (LLMs) on resource-constrained devices remains challenging due to high memory and computational demands; existing low-rank compression methods—relying on uniform rank reduction—suffer from substantial performance degradation and inefficient decoding. This paper proposes a fine-grained low-rank compression framework to address these issues. Our method employs low-rank decomposition while jointly optimizing parameter efficiency and task sensitivity. Key contributions include: (1) a layer-wise adaptive rank allocation strategy that dynamically assigns ranks per layer based on both parameter importance and task-specific requirements; and (2) the first integration of a progressive low-rank mechanism into the decoding process, ensuring generation consistency and output quality. The framework preserves inference stability without compromising compression ratio. Extensive evaluations across multiple benchmarks demonstrate significant improvements over state-of-the-art approaches: up to +17% ROUGE-L gain on summarization tasks, while maintaining high decoding efficiency and effectively mitigating performance loss.
📝 Abstract
Although large language models (LLM) have achieved remarkable performance, their enormous parameter counts hinder deployment on resource-constrained hardware. Low-rank compression can reduce both memory usage and computational demand, but applying a uniform compression ratio across all layers often leads to significant performance degradation, and previous methods perform poorly during decoding. To address these issues, we propose the Fine-grained Low-Rank Compressor (FLRC), which efficiently determines an optimal rank allocation for each layer, and incorporates progressive low-rank decoding to maintain text generation quality. Comprehensive experiments on diverse benchmarks demonstrate the superiority of FLRC, achieving up to a 17% improvement in ROUGE-L on summarization tasks compared to state-of-the-art low-rank compression methods, establishing a more robust and efficient framework to improve LLM inference.