DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy

📅 2025-07-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses two fundamental bottlenecks in referring image segmentation (RIS): insufficient multimodal cognitive capability and the long-tailed distribution of target existence judgment. To this end, we propose a perception–cognition decoupling framework. Methodologically: (1) we explicitly separate visual perception from language-driven cognitive reasoning, identifying weak cognitive modules—not perception—as the primary performance bottleneck in existing models; (2) we introduce a Loopback Synergy mechanism to enhance dynamic, bidirectional interaction between the two modules; (3) we design a non-referring sample transformation data augmentation strategy to improve robustness in distinguishing “no-target” expressions. Evaluated on benchmarks including RefCOCO, our approach significantly improves unified segmentation accuracy across single-reference, non-referring, and multi-reference scenarios. It seamlessly accommodates complex referring expressions without architectural modifications, thereby enhancing robustness in multimodal representation, reasoning, and comprehension.

Technology Category

Application Category

📝 Abstract
Referring Image Segmentation (RIS) is a challenging task that aims to segment objects in an image based on natural language expressions. While prior studies have predominantly concentrated on improving vision-language interactions and achieving fine-grained localization, a systematic analysis of the fundamental bottlenecks in existing RIS frameworks remains underexplored. To bridge this gap, we propose DeRIS, a novel framework that decomposes RIS into two key components: perception and cognition. This modular decomposition facilitates a systematic analysis of the primary bottlenecks impeding RIS performance. Our findings reveal that the predominant limitation lies not in perceptual deficiencies, but in the insufficient multi-modal cognitive capacity of current models. To mitigate this, we propose a Loopback Synergy mechanism, which enhances the synergy between the perception and cognition modules, thereby enabling precise segmentation while simultaneously improving robust image-text comprehension. Additionally, we analyze and introduce a simple non-referent sample conversion data augmentation to address the long-tail distribution issue related to target existence judgement in general scenarios. Notably, DeRIS demonstrates inherent adaptability to both non- and multi-referents scenarios without requiring specialized architectural modifications, enhancing its general applicability. The codes and models are available at https://github.com/Dmmm1997/DeRIS.
Problem

Research questions and friction points this paper is trying to address.

Analyzing bottlenecks in Referring Image Segmentation frameworks
Enhancing multi-modal cognitive capacity for better segmentation
Addressing long-tail distribution in target existence judgement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples perception and cognition modules
Uses Loopback Synergy for enhanced interaction
Introduces non-referent sample augmentation
🔎 Similar Papers
No similar papers found.