🤖 AI Summary
Existing LLM compression methods, particularly singular value decomposition (SVD)-based approaches, adopt uniform rank truncation across layers despite heterogeneous information density, leading to suboptimal accuracy–efficiency trade-offs.
Method: We propose D-Rank, a layer-wise dynamic rank allocation framework that quantifies effective rank to measure per-layer weight matrix information density and employs Lagrangian optimization to achieve adaptive rank assignment under a target compression ratio. D-Rank introduces, for the first time, inter-layer dynamic rank balancing and a GQA-aware attention-layer importance redistribution strategy.
Results: Evaluated on LLaMA models, D-Rank significantly outperforms SVD-LLM and other baselines: at 20% compression, C4 perplexity improves by over 15; at 40% compression, zero-shot accuracy increases by up to 5%, while sustaining higher throughput.
📝 Abstract
Large language models (LLMs) have rapidly scaled in size, bringing severe memory and computational challenges that hinder their deployment. Singular Value Decomposition (SVD)-based compression has emerged as an appealing post-training compression technique for LLMs, yet most existing methods apply a uniform compression ratio across all layers, implicitly assuming homogeneous information included in various layers. This overlooks the substantial intra-layer heterogeneity observed in LLMs, where middle layers tend to encode richer information while early and late layers are more redundant. In this work, we revisit the existing SVD-based compression method and propose D-Rank, a framework with layer-wise balanced Dynamic Rank allocation for LLMs compression. We first introduce effective rank as a principled metric to measure the information density of weight matrices, and then allocate ranks via a Lagrange multiplier-based optimization scheme to adaptively assign more capacity to groups with higher information density under a fixed compression ratio. Moreover, we rebalance the allocated ranks across attention layers to account for their varying importance and extend D-Rank to latest LLMs with grouped-query attention. Extensive experiments on various LLMs with different scales across multiple compression ratios demonstrate that D-Rank consistently outperforms SVD-LLM, ASVD, and Basis Sharing, achieving more than 15 lower perplexity with LLaMA-3-8B model on C4 datasets at 20% compression ratio and up to 5% higher zero-shot reasoning accuracy with LLaMA-7B model at 40% compression ratio while achieving even higher throughput.