🤖 AI Summary
Existing multimodal stance detection methods struggle to effectively distinguish modality-specific signals from cross-modal shared evidence, limiting performance gains. This work proposes DiME, a novel architecture that explicitly decouples stance-related information into three components: text-dominant, vision-dominant, and cross-modal shared representations. DiME employs target-aware chain-of-thought prompting to guide dual encoders in extracting these representations and introduces dedicated loss functions to separately optimize modality-specific and shared components. By integrating contrastive learning, cosine alignment, and a gated fusion mechanism, DiME enables adaptive dynamic ensemble of the decomposed signals. Extensive experiments on four benchmark datasets demonstrate that DiME significantly outperforms current unimodal and multimodal approaches under both in-target and zero-shot settings.
📝 Abstract
Multi-modal stance detection (MSD) aims to determine an author's stance toward a given target using both textual and visual content. While recent methods leverage multi-modal fusion and prompt-based learning, most fail to distinguish between modality-specific signals and cross-modal evidence, leading to suboptimal performance. We propose DiME (Disentangled Multi-modal Experts), a novel architecture that explicitly separates stance information into textual-dominant, visual-dominant, and cross-modal shared components. DiME first uses a target-aware Chain-of-Thought prompt to generate reasoning-guided textual input. Then, dual encoders extract modality features, which are processed by three expert modules with specialized loss functions: contrastive learning for modality-specific experts and cosine alignment for shared representation learning. A gating network adaptively fuses expert outputs for final prediction. Experiments on four benchmark datasets show that DiME consistently outperforms strong unimodal and multi-modal baselines under both in-target and zero-shot settings.