Re-purposing SAM into Efficient Visual Projectors for MLLM-Based Referring Image Segmentation

πŸ“… 2025-09-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the excessive computational overhead in multimodal large language models (MLLMs) for referring image segmentation (RIS), caused by redundant visual tokens, this paper reformulates the Segment Anything Model (SAM) as an efficient visual projector. The core method employs SAM-generated semantic superpixels as β€œvisual words” to enable adaptive token compression; it further introduces position-aware embeddings and a multi-scale aggregation mechanism to preserve fine-grained semantics and global structure despite drastic sequence-length reduction. Experiments demonstrate that the approach reduces visual tokens by 93%, significantly accelerating both training and inference, while achieving performance on par with full-token baselines and substantially outperforming existing token-compression methods in RIS.

Technology Category

Application Category

πŸ“ Abstract
Recently, Referring Image Segmentation (RIS) frameworks that pair the Multimodal Large Language Model (MLLM) with the Segment Anything Model (SAM) have achieved impressive results. However, adapting MLLM to segmentation is computationally intensive, primarily due to visual token redundancy. We observe that traditional patch-wise visual projectors struggle to strike a balance between reducing the number of visual tokens and preserving semantic clarity, often retaining overly long token sequences to avoid performance drops. Inspired by text tokenizers, we propose a novel semantic visual projector that leverages semantic superpixels generated by SAM to identify "visual words" in an image. By compressing and projecting semantic superpixels as visual tokens, our approach adaptively shortens the token sequence according to scene complexity while minimizing semantic loss in compression. To mitigate loss of information, we propose a semantic superpixel positional embedding to strengthen MLLM's awareness of superpixel geometry and position, alongside a semantic superpixel aggregator to preserve both fine-grained details inside superpixels and global context outside. Experiments show that our method cuts visual tokens by 93% without compromising performance, notably speeding up MLLM training and inference, and outperforming existing compressive visual projectors on RIS.
Problem

Research questions and friction points this paper is trying to address.

Reducing visual token redundancy in MLLM-based segmentation
Balancing token compression with semantic preservation
Improving computational efficiency of MLLM-SAM integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages SAM superpixels as visual words
Compresses tokens adaptively by scene complexity
Uses positional embeddings and aggregator for details
πŸ”Ž Similar Papers
No similar papers found.
X
Xiaobo Yang
Zhejiang University, Hangzhou, Zhejiang, China
Xiaojin Gong
Xiaojin Gong
Zhejiang University
Computer VisionImage ProcessingArtificial Intelligence