Omni-Referring Image Segmentation

📅 2025-12-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper introduces the OmniRIS task, addressing highly generalizable multimodal image segmentation by jointly leveraging textual instructions and reference images annotated with masks, bounding boxes, or scribbles as unified prompts—thereby synergizing fine-grained linguistic semantics with precise visual spatial localization. To this end, we propose the first multimodal unified prompting paradigm and design OmniSegNet, an end-to-end trainable architecture incorporating cross-modal alignment and multi-prompt fusion mechanisms. Furthermore, we release OmniRef, a large-scale, multi-scenario benchmark dataset comprising 180K samples. Extensive experiments demonstrate that our method significantly outperforms existing approaches across diverse segmentation settings—including one-to-one and many-to-many configurations—and exhibits strong generalization capability, particularly in localizing rare objects and performing attribute-driven segmentation.

Technology Category

Application Category

📝 Abstract
In this paper, we propose a novel task termed Omni-Referring Image Segmentation (OmniRIS) towards highly generalized image segmentation. Compared with existing unimodally conditioned segmentation tasks, such as RIS and visual RIS, OmniRIS supports the input of text instructions and reference images with masks, boxes or scribbles as omni-prompts. This property makes it can well exploit the intrinsic merits of both text and visual modalities, i.e., granular attribute referring and uncommon object grounding, respectively. Besides, OmniRIS can also handle various segmentation settings, such as one v.s. many and many v.s. many, further facilitating its practical use. To promote the research of OmniRIS, we also rigorously design and construct a large dataset termed OmniRef, which consists of 186,939 omni-prompts for 30,956 images, and establish a comprehensive evaluation system. Moreover, a strong and general baseline termed OmniSegNet is also proposed to tackle the key challenges of OmniRIS, such as omni-prompt encoding. The extensive experiments not only validate the capability of OmniSegNet in following omni-modal instructions, but also show the superiority of OmniRIS for highly generalized image segmentation.
Problem

Research questions and friction points this paper is trying to address.

Proposes OmniRIS for highly generalized image segmentation
Handles text and visual prompts like masks, boxes, scribbles
Introduces OmniRef dataset and OmniSegNet baseline for evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces OmniRIS supporting text and visual omni-prompts
Proposes OmniSegNet baseline for omni-prompt encoding challenges
Constructs OmniRef dataset with comprehensive evaluation system
🔎 Similar Papers
No similar papers found.
Q
Qiancheng Zheng
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
Y
Yunhang Shen
Youtu Lab, Tencent, P.R. China.
Gen Luo
Gen Luo
Shanghai AI Laboratory
computer visionvision and language
B
Baiyang Song
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
Xing Sun
Xing Sun
Tencent Youtu Lab
LLMMLLMAgent
X
Xiaoshuai Sun
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
Yiyi Zhou
Yiyi Zhou
Xiamen University
deep learninglanguage and vision
R
Rongrong Ji
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.