Rethinking Fusion: Disentangled Learning of Shared and Modality-Specific Information for Stance Detection

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal stance detection methods struggle to effectively distinguish modality-specific signals from cross-modal shared evidence, limiting performance gains. This work proposes DiME, a novel architecture that explicitly decouples stance-related information into three components: text-dominant, vision-dominant, and cross-modal shared representations. DiME employs target-aware chain-of-thought prompting to guide dual encoders in extracting these representations and introduces dedicated loss functions to separately optimize modality-specific and shared components. By integrating contrastive learning, cosine alignment, and a gated fusion mechanism, DiME enables adaptive dynamic ensemble of the decomposed signals. Extensive experiments on four benchmark datasets demonstrate that DiME significantly outperforms current unimodal and multimodal approaches under both in-target and zero-shot settings.

Technology Category

Application Category

📝 Abstract
Multi-modal stance detection (MSD) aims to determine an author's stance toward a given target using both textual and visual content. While recent methods leverage multi-modal fusion and prompt-based learning, most fail to distinguish between modality-specific signals and cross-modal evidence, leading to suboptimal performance. We propose DiME (Disentangled Multi-modal Experts), a novel architecture that explicitly separates stance information into textual-dominant, visual-dominant, and cross-modal shared components. DiME first uses a target-aware Chain-of-Thought prompt to generate reasoning-guided textual input. Then, dual encoders extract modality features, which are processed by three expert modules with specialized loss functions: contrastive learning for modality-specific experts and cosine alignment for shared representation learning. A gating network adaptively fuses expert outputs for final prediction. Experiments on four benchmark datasets show that DiME consistently outperforms strong unimodal and multi-modal baselines under both in-target and zero-shot settings.
Problem

Research questions and friction points this paper is trying to address.

multi-modal stance detection
modality-specific information
cross-modal evidence
information disentanglement
Innovation

Methods, ideas, or system contributions that make the work stand out.

disentangled representation
multi-modal fusion
contrastive learning
Chain-of-Thought prompting
stance detection
🔎 Similar Papers
No similar papers found.
Z
Zhiyu Xie
School of Artificial Intelligence, Shenzhen Technology University, Shenzhen, China
F
Fuqiang Niu
School of Cyber Science and Technology, University of Science and Technology of China, Hefei, China
Genan Dai
Genan Dai
Shenzhen Technology University
Spatio-temporal Data Mining
Qianlong Wang
Qianlong Wang
Harbin Institute of Technology, Shenzhen
Natural Language ProcessingMultimodal
L
Li Dong
School of Artificial Intelligence, Shenzhen Technology University, Shenzhen, China
Bowen Zhang
Bowen Zhang
Shenzhen Technology University
sentiment analysisstance detectionsocial computing
Hu Huang
Hu Huang
university of science and technology of china
Social ComputingStance Detection