Optimal Singular Damage: Efficient LLM Inference in Low Storage Regimes

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

To address the high storage overhead of incremental parameters in fine-tuned large language models (LLMs), this paper proposes an efficient compression method integrating low-rank approximation with structured sparsity. The core innovation is the “optimal singular damage” strategy: leveraging interleaved importance scores of left and right singular vectors, it selectively prunes redundant components from low-rank updates, prioritizing retention of the most expressive parameter subspaces under a fixed memory budget. The method combines singular value decomposition (SVD) with importance-aware sparsification to yield a compact, high-fidelity approximation of fine-tuning deltas. Experiments demonstrate that, under identical storage constraints, our approach significantly outperforms baseline methods relying solely on low-rank or sparse representations—achieving higher inference accuracy and superior compression efficiency—thereby establishing a more favorable accuracy–storage trade-off.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly prevalent across diverse applications. However, their enormous size limits storage and processing capabilities to a few well-resourced stakeholders. As a result, most applications rely on pre-trained LLMs, fine-tuned for specific tasks. However, even storing the fine-tuned versions of these models remains a significant challenge due to the wide range of tasks they address. Recently, studies show that fine-tuning these models primarily affects a small fraction of parameters, highlighting the need for more efficient storage of fine-tuned models. This paper focuses on efficient storage of parameter updates in pre-trained models after fine-tuning. To address this challenge, we leverage the observation that fine-tuning updates are both low-rank and sparse, which can be utilized for storage efficiency. However, using only low-rank approximation or sparsification may discard critical singular components that enhance model expressivity. We first observe that given the same memory budget, sparsified low-rank approximations with larger ranks outperform standard low-rank approximations with smaller ranks. Building on this, we propose our method, optimal singular damage, that selectively sparsifies low-rank approximated updates by leveraging the interleaved importance of singular vectors, ensuring that the most impactful components are retained. We demonstrate through extensive experiments that our proposed methods lead to significant storage efficiency and superior accuracy within the same memory budget compared to employing the low-rank approximation or sparsification individually.

Problem

Research questions and friction points this paper is trying to address.

Efficient storage of fine-tuned LLM parameter updates

Retaining critical singular components in low-rank approximations

Optimizing memory usage while maintaining model accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Selectively sparsifies low-rank approximated parameter updates

Retains most impactful singular vector components efficiently

Combines sparsification and low-rank approximation for storage optimization

🔎 Similar Papers

No similar papers found.