X2SAM: Any Segmentation in Images and Videos

πŸ“… 2026-04-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

211K/year
πŸ€– AI Summary
Existing approaches struggle to unify pixel-level segmentation across both images and videos and lack support for complex interactive tasks driven by joint textual and visual prompts. To address these limitations, this work proposes X2SAMβ€”the first multimodal large language model capable of general-purpose, arbitrary segmentation for both images and videos. X2SAM integrates a mask memory module to ensure temporally consistent video mask generation and leverages heterogeneous image-video data through joint training. The model excels in open-vocabulary segmentation, referring expression comprehension, reasoning, and interactive segmentation tasks. It also achieves state-of-the-art performance on V-VGD, a newly introduced benchmark for video visual grounding and segmentation, while maintaining strong vision-language conversational capabilities.
πŸ“ Abstract
Multimodal Large Language Models (MLLMs) have demonstrated strong image-level visual understanding and reasoning, yet their pixel-level perception across both images and videos remains limited. Foundation segmentation models such as the SAM series produce high-quality masks, but they rely on low-level visual prompts and cannot natively interpret complex conversational instructions. Existing segmentation MLLMs narrow this gap, but are usually specialized for either images or videos and rarely support both textual and visual prompts in one interface. We introduce X2SAM, a unified segmentation MLLM that extends any-segmentation capabilities from images to videos. Given conversational instructions and visual prompts, X2SAM couples an LLM with a Mask Memory module that stores guided vision features for temporally consistent video mask generation. The same formulation supports generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. We further introduce the Video Visual Grounded (V-VGD) segmentation benchmark, which evaluates whether a model can segment object tracks in videos from interactive visual prompts. With a unified joint training strategy over heterogeneous image and video datasets, X2SAM delivers strong video segmentation performance, remains competitive on image segmentation benchmarks, and preserves general image and video chat ability.
Problem

Research questions and friction points this paper is trying to address.

segmentation
multimodal large language models
video
image
visual grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

X2SAM
multimodal large language model
video segmentation
Mask Memory
visual grounded segmentation
πŸ”Ž Similar Papers
No similar papers found.