FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deploying large language models (LLMs) on resource-constrained devices remains challenging due to high memory and computational demands; existing low-rank compression methods—relying on uniform rank reduction—suffer from substantial performance degradation and inefficient decoding. This paper proposes a fine-grained low-rank compression framework to address these issues. Our method employs low-rank decomposition while jointly optimizing parameter efficiency and task sensitivity. Key contributions include: (1) a layer-wise adaptive rank allocation strategy that dynamically assigns ranks per layer based on both parameter importance and task-specific requirements; and (2) the first integration of a progressive low-rank mechanism into the decoding process, ensuring generation consistency and output quality. The framework preserves inference stability without compromising compression ratio. Extensive evaluations across multiple benchmarks demonstrate significant improvements over state-of-the-art approaches: up to +17% ROUGE-L gain on summarization tasks, while maintaining high decoding efficiency and effectively mitigating performance loss.

Technology Category

Application Category

📝 Abstract
Although large language models (LLM) have achieved remarkable performance, their enormous parameter counts hinder deployment on resource-constrained hardware. Low-rank compression can reduce both memory usage and computational demand, but applying a uniform compression ratio across all layers often leads to significant performance degradation, and previous methods perform poorly during decoding. To address these issues, we propose the Fine-grained Low-Rank Compressor (FLRC), which efficiently determines an optimal rank allocation for each layer, and incorporates progressive low-rank decoding to maintain text generation quality. Comprehensive experiments on diverse benchmarks demonstrate the superiority of FLRC, achieving up to a 17% improvement in ROUGE-L on summarization tasks compared to state-of-the-art low-rank compression methods, establishing a more robust and efficient framework to improve LLM inference.
Problem

Research questions and friction points this paper is trying to address.

Determining optimal layer-specific compression ratios
Maintaining text quality during low-rank decoding
Reducing memory and computation for LLM deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained rank allocation per layer
Progressive low-rank decoding for quality
Efficient compression for LLM inference
🔎 Similar Papers
No similar papers found.
Y
Yu-Chen Lu
National Yang Ming Chiao Tung University
C
Chong-Yan Chen
National Yang Ming Chiao Tung University
Chi-Chih Chang
Chi-Chih Chang
Cornell University
Efficient Deep Learning
Y
Yu-Fang Hu
National Yang Ming Chiao Tung University
Kai-Chiang Wu
Kai-Chiang Wu
Department of Computer Science, National Yang Ming Chiao Tung University (NYCU, former NCTU)
EDADfT/DfRAI for IC Design & Design AutomationEdge AIEfficient Deep Learning