SkipCat: Rank-Maximized Low-Rank Compression of Large Language Models via Shared Projection and Block Skipping

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

To address computational and memory bottlenecks in deploying large language models (LLMs) on edge devices, existing low-rank compression methods suffer from substantial accuracy degradation at high compression ratios due to aggressive rank reduction, failing to balance efficiency and fidelity. This paper proposes a novel low-rank compression framework that jointly optimizes intra-layer shared low-rank projection matrices and structured sub-block skipping to enhance effective rank utilization—without fine-tuning. It integrates unsupervised low-rank decomposition with weight-activation co-compression. Experiments demonstrate that, at equivalent compression ratios, our method improves zero-shot task accuracy by 7% over state-of-the-art low-rank approaches, reduces inference GPU memory consumption and FLOPs by over 40%, and achieves, for the first time, performance superiority without any fine-tuning.

Technology Category

Application Category

📝 Abstract

Large language models (LLM) have achieved remarkable performance across a wide range of tasks. However, their substantial parameter sizes pose significant challenges for deployment on edge devices with limited computational and memory resources. Low-rank compression is a promising approach to address this issue, as it reduces both computational and memory costs, making LLM more suitable for resource-constrained environments. Nonetheless, naïve low-rank compression methods require a significant reduction in the retained rank to achieve meaningful memory and computation savings. For a low-rank model, the ranks need to be reduced by more than half to yield efficiency gains. Such aggressive truncation, however, typically results in substantial performance degradation. To address this trade-off, we propose SkipCat, a novel low-rank compression framework that enables the use of higher ranks while achieving the same compression rates. First, we introduce an intra-layer shared low-rank projection method, where multiple matrices that share the same input use a common projection. This reduces redundancy and improves compression efficiency. Second, we propose a block skipping technique that omits computations and memory transfers for selected sub-blocks within the low-rank decomposition. These two techniques jointly enable our compressed model to retain more effective ranks under the same compression budget. Experimental results show that, without any additional fine-tuning, our method outperforms previous low-rank compression approaches by 7% accuracy improvement on zero-shot tasks under the same compression rate. These results highlight the effectiveness of our rank-maximized compression strategy in preserving model performance under tight resource constraints.

Problem

Research questions and friction points this paper is trying to address.

Compresses large language models for edge devices with limited resources.

Reduces performance degradation from aggressive low-rank compression truncation.

Enables higher ranks at same compression rates to maintain accuracy.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Intra-layer shared low-rank projection reduces redundancy

Block skipping omits computations for selected sub-blocks

Rank-maximized compression retains more effective ranks under budget

🔎 Similar Papers

Adaptive Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization