Seg-ReSearch: Segmentation with Interleaved Reasoning and External Search

πŸ“… 2026-02-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing language-guided segmentation methods, which are constrained by the static knowledge embedded in multimodal large language models and struggle with open-world queries requiring dynamic or out-of-domain information. To overcome this, we propose Seg-ReSearch, the first framework to integrate external knowledge retrieval with multi-step reasoning into segmentation tasks, thereby circumventing the model’s frozen knowledge bottleneck. Our approach synergistically combines a multimodal large language model, a search engine interface, and a hierarchical reinforcement learning reward mechanism, enabling both image and video segmentation. We introduce OK-VOS, the first video object segmentation benchmark that necessitates external knowledge, and demonstrate substantial performance gains over state-of-the-art methods on this benchmark as well as two established datasets, validating the efficacy of our framework in open-world scenarios.

Technology Category

Application Category

πŸ“ Abstract
Segmentation based on language has been a popular topic in computer vision. While recent advances in multimodal large language models (MLLMs) have endowed segmentation systems with reasoning capabilities, these efforts remain confined by the frozen internal knowledge of MLLMs, which limits their potential for real-world scenarios that involve up-to-date information or domain-specific concepts. In this work, we propose \textbf{Seg-ReSearch}, a novel segmentation paradigm that overcomes the knowledge bottleneck of existing approaches. By enabling interleaved reasoning and external search, Seg-ReSearch empowers segmentation systems to handle dynamic, open-world queries that extend beyond the frozen knowledge of MLLMs. To effectively train this capability, we introduce a hierarchical reward design that harmonizes initial guidance with progressive incentives, mitigating the dilemma between sparse outcome signals and rigid step-wise supervision. For evaluation, we construct OK-VOS, a challenging benchmark that explicitly requires outside knowledge for video object segmentation. Experiments on OK-VOS and two existing reasoning segmentation benchmarks demonstrate that our Seg-ReSearch improves state-of-the-art approaches by a substantial margin. Code and data will be released at https://github.com/iSEE-Laboratory/Seg-ReSearch.
Problem

Research questions and friction points this paper is trying to address.

language-based segmentation
multimodal large language models
knowledge bottleneck
open-world queries
external knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

interleaved reasoning
external search
segmentation
multimodal large language models
hierarchical reward
πŸ”Ž Similar Papers
No similar papers found.