🤖 AI Summary
This work addresses the challenge of jointly optimizing layer-wise ranks for low-rank compression of large language models (LLMs) without fine-tuning. We propose LLRC, a differentiable rank selection method that introduces learnable singular value masks and minimizes the discrepancy between intermediate activations of the compressed and original models on a calibration set—end-to-end and in a fully differentiable manner. By integrating continuous relaxation in the singular value space, coupled singular value decomposition (SVD), and gradient backpropagation, LLRC enables fine-grained, gradient-driven rank optimization per layer. Compared to zero-shot baselines—including STRS, SVD-LLM, and LLM-Pruner—LLRC achieves up to a 12% absolute accuracy gain on MMLU at 20% compression ratio for Llama-2-13B, matching the performance of state-of-the-art fine-tuning–based methods. To our knowledge, this is the first approach to achieve a significant accuracy–compression trade-off improvement without any parameter update or fine-tuning.
📝 Abstract
Approaches for compressing large-language models using low-rank decomposition have made strides, particularly with the introduction of activation and loss-aware SVD, which improves the trade-off between decomposition rank and downstream task performance. Despite these advancements, a persistent challenge remains--selecting the optimal ranks for each layer to jointly optimise compression rate and downstream task accuracy. Current methods either rely on heuristics that can yield sub-optimal results due to their limited discrete search space or are gradient-based but are not as performant as heuristic approaches without post-compression fine-tuning. To address these issues, we propose Learning to Low-Rank Compress (LLRC), a gradient-based approach which directly learns the weights of masks that select singular values in a fine-tuning-free setting. Using a calibration dataset, we train only the mask weights to select fewer and fewer singular values while minimising the divergence of intermediate activations from the original model. Our approach outperforms competing ranking selection methods that similarly require no post-compression fine-tuning across various compression rates on common-sense reasoning and open-domain question-answering tasks. For instance, with a compression rate of 20% on Llama-2-13B, LLRC outperforms the competitive Sensitivity-based Truncation Rank Searching (STRS) on MMLU, BoolQ, and OpenbookQA by 12%, 3.5%, and 4.4%, respectively. Compared to other compression techniques, our approach consistently outperforms fine-tuning-free variants of SVD-LLM and LLM-Pruner across datasets and compression rates. Our fine-tuning-free approach also performs competitively with the fine-tuning variant of LLM-Pruner.