ROSE: Retrieval-Oriented Segmentation Enhancement

πŸ“… 2026-04-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

205K/year
πŸ€– AI Summary
This work addresses the challenge that multimodal large language models struggle to segment novel entities unseen during training or emerging entities requiring up-to-date external knowledge. To tackle this, the paper introduces the Novel and Emerging Segmentation Task (NEST), establishes a benchmark derived from automatically collected news data, and proposes ROSEβ€”a plug-and-play, retrieval-oriented segmentation enhancement framework. ROSE integrates internet-augmented retrieval-augmented generation with joint textual and visual prompting, and incorporates a WebSense mechanism for intelligent retrieval triggering. Experimental results demonstrate that ROSE substantially outperforms the Gemini-2.0 Flash retrieval baseline on the NEST benchmark, achieving a 19.2% improvement in gIoU.

Technology Category

Application Category

πŸ“ Abstract
Existing segmentation models based on multimodal large language models (MLLMs), such as LISA, often struggle with novel or emerging entities due to their inability to incorporate up-to-date knowledge. To address this challenge, we introduce the Novel Emerging Segmentation Task (NEST), which focuses on segmenting (i) novel entities that MLLMs fail to recognize due to their absence from training data, and (ii) emerging entities that exist within the model's knowledge but demand up-to-date external information for accurate recognition. To support the study of NEST, we construct a NEST benchmark using an automated pipeline that generates news-related data samples for comprehensive evaluation. Additionally, we propose ROSE: Retrieval-Oriented Segmentation Enhancement, a plug-and-play framework designed to augment any MLLM-based segmentation model. ROSE comprises four key components. First, an Internet Retrieval-Augmented Generation module is introduced to employ user-provided multimodal inputs to retrieve real-time web information. Then, a Textual Prompt Enhancer enriches the model with up-to-date information and rich background knowledge, improving the model's perception ability for emerging entities. Furthermore, a Visual Prompt Enhancer is proposed to compensate for MLLMs' lack of exposure to novel entities by leveraging internet-sourced images. To maintain efficiency, a WebSense module is introduced to intelligently decide when to invoke retrieval mechanisms based on user input. Experimental results demonstrate that ROSE significantly boosts performance on the NEST benchmark, outperforming a strong Gemini-2.0 Flash-based retrieval baseline by 19.2 in gIoU.
Problem

Research questions and friction points this paper is trying to address.

novel entities
emerging entities
segmentation
multimodal large language models
knowledge updating
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation
Multimodal Large Language Models
Segmentation Enhancement
Emerging Entity Recognition
Plug-and-Play Framework
πŸ”Ž Similar Papers
No similar papers found.