Task Vector Quantization for Memory-Efficient Model Merging

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Multi-task model merging faces scalability challenges due to prohibitive memory overhead from storing numerous task-specific checkpoints—especially for large models and diverse tasks. This work proposes Task Vector Quantization (TVQ), a novel paradigm that quantizes only lightweight task vectors instead of full fine-tuned models, drastically reducing storage requirements. Our key contributions are: (1) residual task vector quantization, which decomposes each task vector into hierarchical residuals and quantizes them progressively; and (2) a sensitivity-aware dynamic bit allocation mechanism that mitigates error accumulation under ultra-low-precision (≤2-bit) quantization. Evaluated on image classification and dense prediction tasks, TVQ maintains or even improves merged model performance while reducing memory footprint to just 8% of that required by full-precision checkpoints. To our knowledge, this is the first method enabling efficient, high-fidelity multi-task model merging at extremely low bit-widths.

Technology Category

Application Category

📝 Abstract

Model merging enables efficient multi-task models by combining task-specific fine-tuned checkpoints. However, storing multiple task-specific checkpoints requires significant memory, limiting scalability and restricting model merging to larger models and diverse tasks. In this paper, we propose quantizing task vectors (i.e., the difference between pre-trained and fine-tuned checkpoints) instead of quantizing fine-tuned checkpoints. We observe that task vectors exhibit a narrow weight range, enabling low precision quantization (up to 4 bit) within existing task vector merging frameworks. To further mitigate quantization errors within ultra-low bit precision (e.g., 2 bit), we introduce Residual Task Vector Quantization, which decomposes the task vector into a base vector and offset component. We allocate bits based on quantization sensitivity, ensuring precision while minimizing error within a memory budget. Experiments on image classification and dense prediction show our method maintains or improves model merging performance while using only 8% of the memory required for full-precision checkpoints.

Problem

Research questions and friction points this paper is trying to address.

Reduces memory usage in model merging

Enables low-precision quantization of task vectors

Improves scalability for multi-task models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantizes task vectors for memory efficiency

Uses Residual Task Vector Quantization technique

Allocates bits based on quantization sensitivity

🔎 Similar Papers

No similar papers found.