$\textit{Don't Guess, Just Ask}$: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work addresses the challenge of ambiguous or underspecified referring expressions commonly encountered in real-world scenarios, which existing referring expression segmentation methods struggle to handle. To this end, we propose IC-Seg, an agent-based framework that introduces, for the first time, an active clarification mechanism into this task, establishing a multi-turn interactive segmentation paradigm. IC-Seg actively clarifies user intent through dialogue before performing segmentation and incorporates a hierarchical gradient reward policy (Hi-GRPO) that provides dense supervision at the trajectory, turn, and step levels to jointly optimize dialogue strategies and segmentation accuracy. Evaluated on Ambi-RVOS—a newly curated benchmark for ambiguous referring expressions—IC-Seg significantly outperforms current approaches while maintaining state-of-the-art performance on standard referring segmentation datasets.

📝 Abstract

Referring segmentation aims to segment the target objects in images or videos based on the textual query. Despite remarkable progress over the past years, existing works always assume that the user-provided queries are already precise and clear. However, this assumption is impractical. In real-world scenarios, it is unrealistic to expect all users to thoroughly review their visual content and carefully ensure their queries are unique and unambiguous. When encountering such cases, existing segmentation models tend to arbitrarily guess the user preferences, often resulting in undesired outcomes. To address this limitation, we propose \textbf{IC-Seg}, a novel agentic framework that proactively clarifies user intent through multi-turn conversation before segmentation. To effectively incentivize this capability, we further introduce \textbf{Hi-GRPO}, a new hierarchical optimization strategy that injects dense and informative supervision signals at the trajectory, turn, and step levels. This strategy encourages efficient intent clarification, effectively eliminating redundant interactions and improving overall dialogue quality. For evaluation, we establish \textbf{Ambi-RVOS}, a referring video object segmentation benchmark with ambiguous user queries. Extensive experiments demonstrate that IC-Seg not only outperforms existing methods by a large margin in resolving ambiguous queries, but also maintains state-of-the-art performance on standard reasoning segmentation benchmarks. Code and data will be released at \url{https://github.com/iSEE-Laboratory/IC-Seg}.

Problem

Research questions and friction points this paper is trying to address.

referring segmentation

ambiguity

user query

clarification

video object segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

referring segmentation

multi-turn clarification

ambiguity resolution