D-QRELO: Training- and Data-Free Delta Compression for Large Language Models via Quantization and Residual Low-Rank Approximation

📅 2026-04-18

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the high memory overhead of incremental models in large-scale supervised fine-tuning and the ineffectiveness of existing compression techniques by proposing a training- and data-free efficient incremental compression method. The approach uniquely combines coarse-grained single-bit quantization to capture the dominant structure of the incremental update with a compensatory residual low-rank approximation to recover fine-grained details. Through systematic analysis of how task difficulty, model architecture, and layer position influence compression performance, the study uncovers fundamental patterns governing effective incremental compression. Experiments demonstrate that the proposed method substantially outperforms current state-of-the-art techniques across diverse dense and mixture-of-experts (MoE) large language models, exhibiting strong efficiency, broad applicability, and robust generalization capabilities.

Technology Category

Application Category

📝 Abstract

Supervised Fine-Tuning (SFT) accelerates taskspecific large language models (LLMs) development, but the resulting proliferation of finetuned models incurs substantial memory overhead. Delta compression addresses this by retaining a single pre-trained LLM with multiple compressed delta weights. However, existing methods fail on models fine-tuned with largescale datasets. We find that larger SFT data scale amplifies delta parameter magnitude, singular values, and entropy, exacerbating compression errors. To tackle this, we propose DQRELO (Delta Compression via Quantization and Residual Low-Rank), a novel training- and data-free delta compression method. It combines coarse-grained one-bit quantization to capture the dominant structure of the delta, followed by compensated residual low-rank approximation to recover fine-grained details from the smaller residual error. Experiments on various LLMs spanning dense and MoE architectures across multiple domains under this challenging setting demonstrate that DQRELO outperforms existing methods. Moreover, we establish key design principles for delta compression through extensive empirical analysis, demonstrating how task difficulty, architecture, and layer positioning create predictable patterns that can guide optimal compression strategies in production systems.

Problem

Research questions and friction points this paper is trying to address.

delta compression

large language models

supervised fine-tuning

memory overhead

compression error

Innovation

Methods, ideas, or system contributions that make the work stand out.

Delta Compression

Quantization

Low-Rank Approximation