🤖 AI Summary
To address the scarcity of visible-light cues and the performance limitations of unimodal methods in camouflaged object segmentation (COS), this paper proposes UniCOS—the first framework to employ state-space models (SSMs) for dynamic cross-modal feature modeling and fusion. Its key contributions are: (1) a state-space-driven cross-modal fusion mechanism with feedback architecture; (2) the UniLearner module, which leverages non-COS multimodal data (e.g., infrared, depth) to synthesize pseudo-modal content and establish semantic correspondences, enabling label-free knowledge transfer; and (3) joint semantic alignment learning and multimodal knowledge distillation. Evaluated on both real and pseudo-multimodal COS benchmarks, UniCOS achieves significant improvements over state-of-the-art methods—gaining +4.2% mIoU using only off-the-shelf non-COS multimodal data—demonstrating its effectiveness in bridging the modality gap without requiring COS-specific annotations.
📝 Abstract
Camouflaged Object Segmentation (COS) remains a challenging problem due to the subtle visual differences between camouflaged objects and backgrounds. Owing to the exceedingly limited visual cues available from visible spectrum, previous RGB single-modality approaches often struggle to achieve satisfactory results, prompting the exploration of multimodal data to enhance detection accuracy. In this work, we present UniCOS, a novel framework that effectively leverages diverse data modalities to improve segmentation performance. UniCOS comprises two key components: a multimodal segmentor, UniSEG, and a cross-modal knowledge learning module, UniLearner. UniSEG employs a state space fusion mechanism to integrate cross-modal features within a unified state space, enhancing contextual understanding and improving robustness to integration of heterogeneous data. Additionally, it includes a fusion-feedback mechanism that facilitate feature extraction. UniLearner exploits multimodal data unrelated to the COS task to improve the segmentation ability of the COS models by generating pseudo-modal content and cross-modal semantic associations. Extensive experiments demonstrate that UniSEG outperforms existing Multimodal COS (MCOS) segmentors, regardless of whether real or pseudo-multimodal COS data is available. Moreover, in scenarios where multimodal COS data is unavailable but multimodal non-COS data is accessible, UniLearner effectively exploits these data to enhance segmentation performance. Our code will be made publicly available on href{https://github.com/cnyvfang/UniCOS}{GitHub}.