FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Deploying large language models (LLMs) on resource-constrained devices remains challenging due to high memory and computational demands; existing low-rank compression methods—relying on uniform rank reduction—suffer from substantial performance degradation and inefficient decoding. This paper proposes a fine-grained low-rank compression framework to address these issues. Our method employs low-rank decomposition while jointly optimizing parameter efficiency and task sensitivity. Key contributions include: (1) a layer-wise adaptive rank allocation strategy that dynamically assigns ranks per layer based on both parameter importance and task-specific requirements; and (2) the first integration of a progressive low-rank mechanism into the decoding process, ensuring generation consistency and output quality. The framework preserves inference stability without compromising compression ratio. Extensive evaluations across multiple benchmarks demonstrate significant improvements over state-of-the-art approaches: up to +17% ROUGE-L gain on summarization tasks, while maintaining high decoding efficiency and effectively mitigating performance loss.

Technology Category

Application Category

📝 Abstract

Although large language models (LLM) have achieved remarkable performance, their enormous parameter counts hinder deployment on resource-constrained hardware. Low-rank compression can reduce both memory usage and computational demand, but applying a uniform compression ratio across all layers often leads to significant performance degradation, and previous methods perform poorly during decoding. To address these issues, we propose the Fine-grained Low-Rank Compressor (FLRC), which efficiently determines an optimal rank allocation for each layer, and incorporates progressive low-rank decoding to maintain text generation quality. Comprehensive experiments on diverse benchmarks demonstrate the superiority of FLRC, achieving up to a 17% improvement in ROUGE-L on summarization tasks compared to state-of-the-art low-rank compression methods, establishing a more robust and efficient framework to improve LLM inference.

Problem

Research questions and friction points this paper is trying to address.

Determining optimal layer-specific compression ratios

Maintaining text quality during low-rank decoding

Reducing memory and computation for LLM deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained rank allocation per layer

Progressive low-rank decoding for quality

Efficient compression for LLM inference

🔎 Similar Papers

Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models