Refer to Anything with Vision-Language Prompts

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Existing image segmentation models lack semantic understanding of joint language-vision prompts, hindering user-friendly cross-modal interaction. To address this, we introduce Object-Referential Expression Segmentation (ORES), a novel task enabling segmentation of arbitrary visual-language queries—either pure text or text coupled with reference image regions—into corresponding mask groups. We formally define the ORES task for the first time; construct MaskGroups-2M/HQ, the first high-quality mask-group benchmark; and propose RAS, a mask-centric large model integrating multimodal encoders, vision-language-mask ternary alignment training, and synthetic data augmentation. Experiments demonstrate that RAS achieves state-of-the-art performance on ORES, Referring Expression Segmentation (RES), and Grounded Referring Expression Segmentation (GRES), significantly advancing complex cross-modal referential understanding and fine-grained segmentation capabilities.

Technology Category

Application Category

📝 Abstract

Recent image segmentation models have advanced to segment images into high-quality masks for visual entities, and yet they cannot provide comprehensive semantic understanding for complex queries based on both language and vision. This limitation reduces their effectiveness in applications that require user-friendly interactions driven by vision-language prompts. To bridge this gap, we introduce a novel task of omnimodal referring expression segmentation (ORES). In this task, a model produces a group of masks based on arbitrary prompts specified by text only or text plus reference visual entities. To address this new challenge, we propose a novel framework to"Refer to Any Segmentation Mask Group"(RAS), which augments segmentation models with complex multimodal interactions and comprehension via a mask-centric large multimodal model. For training and benchmarking ORES models, we create datasets MaskGroups-2M and MaskGroups-HQ to include diverse mask groups specified by text and reference entities. Through extensive evaluation, we demonstrate superior performance of RAS on our new ORES task, as well as classic referring expression segmentation (RES) and generalized referring expression segmentation (GRES) tasks. Project page: https://Ref2Any.github.io.

Problem

Research questions and friction points this paper is trying to address.

Segmentation models lack semantic understanding for vision-language queries

Need user-friendly interactions via multimodal prompts in segmentation

Propose omnimodal referring expression segmentation for complex queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Omnimodal referring expression segmentation (ORES) task

Mask-centric large multimodal model (RAS)

Diverse datasets MaskGroups-2M and MaskGroups-HQ

🔎 Similar Papers

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions