R2SM: Referring and Reasoning for Selective Masks

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
R2SM addresses the user-intent-driven mask-type selection problem in text-guided segmentation: given a natural language prompt, the model must determine whether to generate a visible (modal) or complete (amodal) segmentation mask. This work is the first to formulate mask-type decision-making as a language intention understanding task. We introduce the first R2SM benchmark supporting modal/amodal binary classification, constructed by unifying COOA-cls, D2SA, and MUVA datasets, augmented with cross-dataset mask synthesis and fine-grained intention annotation. We propose an end-to-end vision-language framework enabling joint reasoning and mask generation. Experiments demonstrate significant improvements in intention recognition accuracy and mask quality, achieving quantifiable gains in both modal/amodal classification and segmentation precision. Our approach establishes a new paradigm for intention-aware multimodal segmentation.

Technology Category

Application Category

📝 Abstract
We introduce a new task, Referring and Reasoning for Selective Masks (R2SM), which extends text-guided segmentation by incorporating mask-type selection driven by user intent. This task challenges vision-language models to determine whether to generate a modal (visible) or amodal (complete) segmentation mask based solely on natural language prompts. To support the R2SM task, we present the R2SM dataset, constructed by augmenting annotations of COCOA-cls, D2SA, and MUVA. The R2SM dataset consists of both modal and amodal text queries, each paired with the corresponding ground-truth mask, enabling model finetuning and evaluation for the ability to segment images as per user intent. Specifically, the task requires the model to interpret whether a given prompt refers to only the visible part of an object or to its complete shape, including occluded regions, and then produce the appropriate segmentation. For example, if a prompt explicitly requests the whole shape of a partially hidden object, the model is expected to output an amodal mask that completes the occluded parts. In contrast, prompts without explicit mention of hidden regions should generate standard modal masks. The R2SM benchmark provides a challenging and insightful testbed for advancing research in multimodal reasoning and intent-aware segmentation.
Problem

Research questions and friction points this paper is trying to address.

Determining modal or amodal masks from text prompts
Segmenting objects based on user intent in queries
Interpreting prompts for visible vs. complete object shapes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces R2SM task for intent-driven segmentation
Uses modal and amodal text queries
Augments COCOA-cls, D2SA, MUVA datasets
🔎 Similar Papers
No similar papers found.