CAR-SAM: Cross-Attention Reconstruction for Post-Training Quantization of the Segment Anything Model

📅 2026-05-16

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing post-training quantization methods for the Segment Anything Model (SAM) overlook the cross-attention structure in its decoder, leading to attention information dispersion and unstable reconstruction optimization, which severely degrades segmentation performance at low bit-widths. To address this, this work proposes CAR-SAM, the first quantization framework specifically tailored to SAM’s cross-attention architecture. It introduces a MatMul-aware compensation mechanism to mitigate attention dispersion and a joint cross-attention reconstruction strategy to stabilize decoder optimization. The proposed method successfully quantizes both SAM-B and SAM-L to 4 bits, achieving mAP improvements of 14.6% and 6.6% over current state-of-the-art approaches, respectively, thereby significantly enhancing low-bit deployment performance.

📝 Abstract

Segment Anything Models (SAMs) are extensively used in computer vision for universal image segmentation, but deploying them on resource-constrained devices is challenging due to their high computational and memory demands. Post-Training Quantization (PTQ) is a widely used technique for model compression and acceleration. However, existing PTQ methods fail to consider the cross-attention architecture in the SAM decoder. This degradation primarily stems from the unique challenges posed by SAMs: (1) Attention dissipation, where the attention information in the decoder, which is crucial for representing segmentation masks, collapses into a diffuse and non-semantic form under low-bit quantization; and (2) Reconstruction oscillation, where bidirectional coupling within the two-way transformer introduces cross-branch error interference and destabilizes convergence. To tackle these issues, we propose CAR-SAM, a unified quantization framework tailored for SAMs. Firstly, to mitigate attention dissipation, we introduce MatMul-Aware Compensation (MAC) mechanism that transfers activation-induced quantization errors from MatMul to preceding linear weights. Secondly, to mitigate oscillation in decoder optimization, we develop a Joint Cross-Attention Reconstruction (JCAR) strategy that jointly reconstructs coupled attention branches, suppressing oscillatory behavior and promoting stable convergence. Extensive experiments show that CAR-SAM robustly quantizes SAM models down to 4-bit precision, surpassing existing methods by 14.6% and 6.6% mAP on SAM-B and SAM-L respectively.

Problem

Research questions and friction points this paper is trying to address.

Post-Training Quantization

Segment Anything Model

Cross-Attention

Attention Dissipation

Reconstruction Oscillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Attention Reconstruction

Post-Training Quantization

Segment Anything Model