CASP: Compression of Large Multimodal Models Based on Attention Sparsity

📅 2025-03-07

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the severe performance degradation of large multimodal models (LMMs) under ultra-low-bit quantization (e.g., 2-bit). We first theoretically establish a tight upper bound linking the intrinsic sparsity of attention matrices to compression errors in Query/Key weight quantization. Building on this insight, we propose a data-aware, hierarchical low-rank decomposition framework coupled with cross-layer adaptive bit allocation—a unified compression paradigm compatible with state-of-the-art 2-bit methods such as AQLM and QuIP#. Evaluated on image-language and video-language benchmarks, our approach achieves an average accuracy improvement of 21% over prior 2-bit baselines, significantly enhancing both inference accuracy and efficiency of LMMs at ultra-low bitwidths.

Technology Category

Application Category

📝 Abstract

In this work, we propose an extreme compression technique for Large Multimodal Models (LMMs). While previous studies have explored quantization as an efficient post-training compression method for Large Language Models (LLMs), low-bit compression for multimodal models remains under-explored. The redundant nature of inputs in multimodal models results in a highly sparse attention matrix. We theoretically and experimentally demonstrate that the attention matrix's sparsity bounds the compression error of the Query and Key weight matrices. Based on this, we introduce CASP, a model compression technique for LMMs. Our approach performs a data-aware low-rank decomposition on the Query and Key weight matrix, followed by quantization across all layers based on an optimal bit allocation process. CASP is compatible with any quantization technique and enhances state-of-the-art 2-bit quantization methods (AQLM and QuIP#) by an average of 21% on image- and video-language benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Compression of Large Multimodal Models (LMMs)

Exploration of low-bit compression for multimodal models

Enhancement of 2-bit quantization methods for image- and video-language benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-aware low-rank decomposition for weight matrices

Optimal bit allocation across all layers

Enhancement of 2-bit quantization methods

🔎 Similar Papers

Sparsely Multimodal Data Fusion