π€ AI Summary
This work addresses the limitations of existing language-guided segmentation methods, which are constrained by the static knowledge embedded in multimodal large language models and struggle with open-world queries requiring dynamic or out-of-domain information. To overcome this, we propose Seg-ReSearch, the first framework to integrate external knowledge retrieval with multi-step reasoning into segmentation tasks, thereby circumventing the modelβs frozen knowledge bottleneck. Our approach synergistically combines a multimodal large language model, a search engine interface, and a hierarchical reinforcement learning reward mechanism, enabling both image and video segmentation. We introduce OK-VOS, the first video object segmentation benchmark that necessitates external knowledge, and demonstrate substantial performance gains over state-of-the-art methods on this benchmark as well as two established datasets, validating the efficacy of our framework in open-world scenarios.
π Abstract
Segmentation based on language has been a popular topic in computer vision. While recent advances in multimodal large language models (MLLMs) have endowed segmentation systems with reasoning capabilities, these efforts remain confined by the frozen internal knowledge of MLLMs, which limits their potential for real-world scenarios that involve up-to-date information or domain-specific concepts. In this work, we propose \textbf{Seg-ReSearch}, a novel segmentation paradigm that overcomes the knowledge bottleneck of existing approaches. By enabling interleaved reasoning and external search, Seg-ReSearch empowers segmentation systems to handle dynamic, open-world queries that extend beyond the frozen knowledge of MLLMs. To effectively train this capability, we introduce a hierarchical reward design that harmonizes initial guidance with progressive incentives, mitigating the dilemma between sparse outcome signals and rigid step-wise supervision. For evaluation, we construct OK-VOS, a challenging benchmark that explicitly requires outside knowledge for video object segmentation. Experiments on OK-VOS and two existing reasoning segmentation benchmarks demonstrate that our Seg-ReSearch improves state-of-the-art approaches by a substantial margin. Code and data will be released at https://github.com/iSEE-Laboratory/Seg-ReSearch.