Layer-wise dynamic rank for compressing large language models

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing LLM compression methods, particularly singular value decomposition (SVD)-based approaches, adopt uniform rank truncation across layers despite heterogeneous information density, leading to suboptimal accuracy–efficiency trade-offs. Method: We propose D-Rank, a layer-wise dynamic rank allocation framework that quantifies effective rank to measure per-layer weight matrix information density and employs Lagrangian optimization to achieve adaptive rank assignment under a target compression ratio. D-Rank introduces, for the first time, inter-layer dynamic rank balancing and a GQA-aware attention-layer importance redistribution strategy. Results: Evaluated on LLaMA models, D-Rank significantly outperforms SVD-LLM and other baselines: at 20% compression, C4 perplexity improves by over 15; at 40% compression, zero-shot accuracy increases by up to 5%, while sustaining higher throughput.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have rapidly scaled in size, bringing severe memory and computational challenges that hinder their deployment. Singular Value Decomposition (SVD)-based compression has emerged as an appealing post-training compression technique for LLMs, yet most existing methods apply a uniform compression ratio across all layers, implicitly assuming homogeneous information included in various layers. This overlooks the substantial intra-layer heterogeneity observed in LLMs, where middle layers tend to encode richer information while early and late layers are more redundant. In this work, we revisit the existing SVD-based compression method and propose D-Rank, a framework with layer-wise balanced Dynamic Rank allocation for LLMs compression. We first introduce effective rank as a principled metric to measure the information density of weight matrices, and then allocate ranks via a Lagrange multiplier-based optimization scheme to adaptively assign more capacity to groups with higher information density under a fixed compression ratio. Moreover, we rebalance the allocated ranks across attention layers to account for their varying importance and extend D-Rank to latest LLMs with grouped-query attention. Extensive experiments on various LLMs with different scales across multiple compression ratios demonstrate that D-Rank consistently outperforms SVD-LLM, ASVD, and Basis Sharing, achieving more than 15 lower perplexity with LLaMA-3-8B model on C4 datasets at 20% compression ratio and up to 5% higher zero-shot reasoning accuracy with LLaMA-7B model at 40% compression ratio while achieving even higher throughput.

Problem

Research questions and friction points this paper is trying to address.

Addressing uniform compression ratio limitations in LLM SVD-based compression

Optimizing layer-wise rank allocation for heterogeneous information density in LLMs

Improving compression performance while maintaining model accuracy and throughput

Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-wise dynamic rank allocation for compression

Effective rank metric measures layer information density

Lagrange optimization adapts ranks to information density

🔎 Similar Papers

No similar papers found.

Authors to Follow