Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing referring expression segmentation methods, which rely heavily on annotated data and struggle to generalize to arbitrary explicit or implicit linguistic expressions, while naive integration of multimodal large language models with SAM3 often yields coarse segmentation masks. To overcome these challenges, the authors propose Tarot-SAM3, a training-free framework that first employs an Expression Reasoning Interpreter (ERI) to translate natural language into structured heterogeneous prompts, followed by a Mask Self-Refinement (MSR) stage leveraging DINOv3 feature relationships to select and refine SAM3 outputs. This approach achieves state-of-the-art zero-shot performance across explicit, implicit, and open-world referring expression segmentation benchmarks. Ablation studies confirm the effectiveness of both ERI and MSR, demonstrating that decoupling language understanding from mask refinement significantly enhances generalization and robustness.
📝 Abstract
Referring Expression Segmentation (RES) aims to segment image regions described by natural-language expressions, serving as a bridge between vision and language understanding. Existing RES methods, however, rely heavily on large annotated datasets and are limited to either explicit or implicit expressions, hindering their ability to generalize to any referring expression. Recently, the Segment Anything Model 3 (SAM3) has shown impressive robustness in Promptable Concept Segmentation. Nonetheless, applying it to RES remains challenging: (1) SAM3 struggles with longer or implicit expressions; (2) naive coupling of SAM3 with a multimodal large language model (MLLM) makes the final results overly dependent on the MLLM's reasoning capability, without enabling refinement of SAM3's segmentation outputs. To this end, we present Tarot-SAM3, a novel training-free framework that can accurately segment from any referring expression. Specifically, Tarot-SAM3 consists of two key phases. First, the Expression Reasoning Interpreter (ERI) phase introduces reasoning-assisted prompt options to support structured expression parsing and evaluation-aware rephrasing. This transforms arbitrary queries into robust heterogeneous prompts for generating reliable masks with SAM3. Second, the Mask Self-Refining (MSR) phase selects the best mask across prompt types and performs self-refinement by leveraging rich feature relationships from DINOv3 to compare discriminative regions among ERI outputs. It then infers region affiliation to the target, thereby correcting over- and under-segmentation. Extensive experiments demonstrate that Tarot-SAM3 achieves strong performance on both explicit and implicit RES benchmarks, as well as open-world scenarios. Ablation studies further validate the effectiveness of each phase.
Problem

Research questions and friction points this paper is trying to address.

Referring Expression Segmentation
Segment Anything Model
implicit expressions
vision-language understanding
promptable segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free
referring expression segmentation
prompt engineering
mask self-refinement
multimodal reasoning
🔎 Similar Papers
No similar papers found.
W
Weiming Zhang
The Hong Kong University of Science and Technology (Guangzhou)
D
Dingwen Xiao
The Hong Kong University of Science and Technology (Guangzhou)
S
Songyue Guo
The Hong Kong University of Science and Technology (Guangzhou)
G
Guangyu Xiang
The Hong Kong University of Science and Technology (Guangzhou)
S
Shiqi Wen
The Hong Kong University of Science and Technology (Guangzhou)
Minwei Zhao
Minwei Zhao
UCL Department of Geography
Spatial inequalityhealth geographyurban analysis
Lei Chen
Lei Chen
Hong Kong University of Science and Technology
Human Powered Machine LearningDatabasesData Mining
L
Lin Wang
Nanyang Technological University