🤖 AI Summary
To address the limitations of single- or dual-modal guidance and insufficient modeling of complex semantics in few-shot segmentation (FSS), this paper proposes the Dynamic Multimodal Fusion Framework (DFR). DFR introduces a novel tri-modal (vision, text, audio) decomposition-fusion-reconstruction mechanism: it leverages SAM to generate visual region proposals, integrates hierarchical semantic expansion with audio feature extraction, employs a contrastive cross-modal fusion module for semantic alignment, and adopts a dual-path reconstruction architecture to jointly optimize geometric and semantic consistency. Evaluated on both synthetic and real-world benchmarks, DFR significantly outperforms state-of-the-art methods. Experimental results demonstrate that dynamic multimodal interaction is critical for enhancing the robustness and generalization capability of few-shot segmentation, particularly under limited labeled support.
📝 Abstract
This paper presents DFR (Decompose, Fuse and Reconstruct), a novel framework that addresses the fundamental challenge of effectively utilizing multi-modal guidance in few-shot segmentation (FSS). While existing approaches primarily rely on visual support samples or textual descriptions, their single or dual-modal paradigms limit exploitation of rich perceptual information available in real-world scenarios. To overcome this limitation, the proposed approach leverages the Segment Anything Model (SAM) to systematically integrate visual, textual, and audio modalities for enhanced semantic understanding. The DFR framework introduces three key innovations: 1) Multi-modal Decompose: a hierarchical decomposition scheme that extracts visual region proposals via SAM, expands textual semantics into fine-grained descriptors, and processes audio features for contextual enrichment; 2) Multi-modal Contrastive Fuse: a fusion strategy employing contrastive learning to maintain consistency across visual, textual, and audio modalities while enabling dynamic semantic interactions between foreground and background features; 3) Dual-path Reconstruct: an adaptive integration mechanism combining semantic guidance from tri-modal fused tokens with geometric cues from multi-modal location priors. Extensive experiments across visual, textual, and audio modalities under both synthetic and real settings demonstrate DFR's substantial performance improvements over state-of-the-art methods.