X-SAM: From Segment Anything to Any Segmentation

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Current large language models (LLMs) lack pixel-level perception capabilities, while the Segment Anything Model (SAM) supports vision-guided segmentation but struggles with multi-mask prediction, class-specific segmentation, and unified task modeling. To address these limitations, we propose X-SAM—the first unified multimodal large model framework supporting arbitrary image segmentation. Our approach introduces: (1) a novel “vision-anchored segmentation” task that jointly models generic, instance, and class-specific segmentation; (2) an extended SAM architecture incorporating a vision prompt encoder and a unified multi-task loss function; and (3) end-to-end training on large-scale image–text–mask triplets. Experiments demonstrate that X-SAM achieves state-of-the-art performance on benchmarks for open-vocabulary segmentation, interactive segmentation, and dense scene understanding, significantly enhancing pixel-level semantic comprehension and cross-task generalization.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from extit{segment anything} to extit{any segmentation}. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding. Code is available at https://github.com/wanghao9610/X-SAM.

Problem

Research questions and friction points this paper is trying to address.

LLMs lack pixel-level perceptual understanding

SAM struggles with multi-mask and category-specific segmentation

No unified model integrates all segmentation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified MLLM framework for advanced segmentation

Visual GrounDed segmentation with interactive prompts

Unified training strategy for diverse datasets

🔎 Similar Papers

On Efficient Variants of Segment Anything Model: A Survey