QSVD: Efficient Low-rank Approximation for Unified Query-Key-Value Weight Compression in Low-Precision Vision-Language Models

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual language models (VLMs) suffer from high memory footprint, prolonged inference latency, and substantial KV cache overhead due to the large size of their QKV weight matrices. To address these challenges, this paper proposes a joint low-rank approximation and quantization compression framework. Our method unifies the singular value decomposition (SVD) of QKV weights and introduces a dynamic rank allocation strategy that preserves semantically critical channels while achieving parameter-efficient compression. Furthermore, we co-optimize low-precision quantization of both weights and activations. Experiments demonstrate that our approach significantly reduces KV cache size and computational cost on edge devices—cutting hardware resource consumption by over 30%—while improving accuracy by more than 10% relative to state-of-the-art compression methods. This yields markedly enhanced real-time inference capability and deployment efficiency for VLMs in resource-constrained scenarios.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) are integral to tasks such as image captioning and visual question answering, but their high computational cost, driven by large memory footprints and processing time, limits their scalability and real-time applicability. In this work, we propose leveraging Singular-Value Decomposition (SVD) over the joint query (Q), key (K), and value (V) weight matrices to reduce KV cache size and computational overhead. We in addition introduce an efficient rank allocation strategy that dynamically adjusts the SVD rank based on its impact on VLM accuracy, achieving a significant reduction in both memory usage and computational cost. Finally, we extend this approach by applying quantization to both VLM weights and activations, resulting in a highly efficient VLM. Our method outperforms previous approaches that rely solely on quantization or SVD by achieving more than $10%$ accuracy improvement while consuming less hardware cost, making it better for real-time deployment on resource-constrained devices. We open source our code at href{https://github.com/SAI-Lab-NYU/QSVD}{ exttt{https://github.com/SAI-Lab-NYU/QSVD}}.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational cost in Vision-Language Models
Compressing query-key-value weights via SVD and quantization
Improving efficiency for deployment on resource-constrained devices
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint SVD compression of query, key, value weight matrices
Dynamic rank allocation strategy for accuracy preservation
Quantization applied to both weights and activations
🔎 Similar Papers
No similar papers found.
Y
Yutong Wang
Tandon School of Engineering, New York University
H
Haiyu Wang
Tandon School of Engineering, New York University
Sai Qian Zhang
Sai Qian Zhang
New York University