🤖 AI Summary
To address the challenge of applying post-training quantization (PTQ) to the Segment Anything Model (SAM) on edge devices—hindered by its high computational overhead—this work proposes a semantics-aligned PTQ framework that overcomes two critical bottlenecks: anomalous attention behavior in the mask decoder and distorted prompt-visual interaction. Our method introduces perception-consistent pruning, based on attention focus overlap, and prompt-aware reconstruction, which explicitly models cross-attention responses to align visual tokens with prompts. Additionally, we propose a visual token layer-skipping strategy to accelerate inference. The framework jointly aligns both distributional and semantic representations while incorporating dynamic pruning and layer skipping for end-to-end optimization. On SAM-B, our 4-bit quantized model achieves a +11.7% improvement in instance segmentation mAP. Extensive evaluations across multiple SAM variants—including diverse tasks and input scales—demonstrate consistent superiority over state-of-the-art PTQ methods.
📝 Abstract
Segment Anything Model (SAM) exhibits remarkable zero-shot segmentation capability; however, its prohibitive computational costs make edge deployment challenging. Although post-training quantization (PTQ) offers a promising compression solution, existing methods yield unsatisfactory results when applied to SAM, owing to its specialized model components and promptable workflow: (i) The mask decoder's attention exhibits extreme outliers, and we find that aggressive clipping (ranging down to even 100$ imes$), instead of smoothing or isolation, is effective in suppressing outliers while maintaining semantic capabilities. Unfortunately, traditional metrics (e.g., MSE) fail to provide such large-scale clipping. (ii) Existing reconstruction methods potentially neglect prompts' intention, resulting in distorted visual encodings during prompt interactions. To address the above issues, we propose SAQ-SAM in this paper, which boosts PTQ of SAM with semantic alignment. Specifically, we propose Perceptual-Consistency Clipping, which exploits attention focus overlap as clipping metric, to significantly suppress outliers. Furthermore, we propose Prompt-Aware Reconstruction, which incorporates visual-prompt interactions by leveraging cross-attention responses in mask decoder, thus facilitating alignment in both distribution and semantics. To ensure the interaction efficiency, we also introduce a layer-skipping strategy for visual tokens. Extensive experiments are conducted on different segmentation tasks and SAMs of various sizes, and the results show that the proposed SAQ-SAM consistently outperforms baselines. For example, when quantizing SAM-B to 4-bit, our method achieves 11.7% higher mAP than the baseline in instance segmentation task.