SAQ-SAM: Semantically-Aligned Quantization for Segment Anything Model

📅 2025-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of applying post-training quantization (PTQ) to the Segment Anything Model (SAM) on edge devices—hindered by its high computational overhead—this work proposes a semantics-aligned PTQ framework that overcomes two critical bottlenecks: anomalous attention behavior in the mask decoder and distorted prompt-visual interaction. Our method introduces perception-consistent pruning, based on attention focus overlap, and prompt-aware reconstruction, which explicitly models cross-attention responses to align visual tokens with prompts. Additionally, we propose a visual token layer-skipping strategy to accelerate inference. The framework jointly aligns both distributional and semantic representations while incorporating dynamic pruning and layer skipping for end-to-end optimization. On SAM-B, our 4-bit quantized model achieves a +11.7% improvement in instance segmentation mAP. Extensive evaluations across multiple SAM variants—including diverse tasks and input scales—demonstrate consistent superiority over state-of-the-art PTQ methods.

Technology Category

Application Category

📝 Abstract
Segment Anything Model (SAM) exhibits remarkable zero-shot segmentation capability; however, its prohibitive computational costs make edge deployment challenging. Although post-training quantization (PTQ) offers a promising compression solution, existing methods yield unsatisfactory results when applied to SAM, owing to its specialized model components and promptable workflow: (i) The mask decoder's attention exhibits extreme outliers, and we find that aggressive clipping (ranging down to even 100$ imes$), instead of smoothing or isolation, is effective in suppressing outliers while maintaining semantic capabilities. Unfortunately, traditional metrics (e.g., MSE) fail to provide such large-scale clipping. (ii) Existing reconstruction methods potentially neglect prompts' intention, resulting in distorted visual encodings during prompt interactions. To address the above issues, we propose SAQ-SAM in this paper, which boosts PTQ of SAM with semantic alignment. Specifically, we propose Perceptual-Consistency Clipping, which exploits attention focus overlap as clipping metric, to significantly suppress outliers. Furthermore, we propose Prompt-Aware Reconstruction, which incorporates visual-prompt interactions by leveraging cross-attention responses in mask decoder, thus facilitating alignment in both distribution and semantics. To ensure the interaction efficiency, we also introduce a layer-skipping strategy for visual tokens. Extensive experiments are conducted on different segmentation tasks and SAMs of various sizes, and the results show that the proposed SAQ-SAM consistently outperforms baselines. For example, when quantizing SAM-B to 4-bit, our method achieves 11.7% higher mAP than the baseline in instance segmentation task.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational costs for edge deployment of SAM.
Improves quantization by addressing extreme outliers in attention.
Enhances semantic alignment during prompt interactions in SAM.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Perceptual-Consistency Clipping for outlier suppression
Prompt-Aware Reconstruction for semantic alignment
Layer-skipping strategy for efficient visual token processing
🔎 Similar Papers
J
Jing Zhang
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Z
Zhikai Li
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Qingyi Gu
Qingyi Gu
Institute of Automation, Chinese Academy of Sciences
High-speed visioncell analysis