InconVAD: A Two-Stage Dual-Tower Framework for Multimodal Emotion Inconsistency Detection

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Cross-modal sentiment inconsistency detection faces three key challenges: incomplete unimodal representations, unconditional fusion that degrades discriminative power, and the absence of explicit modeling of inconsistency itself. To address these, we propose a two-stage dual-tower framework grounded in the continuous Valence-Arousal-Dominance (VAD) emotion space. In Stage I, uncertainty-aware unimodal encoders independently model probabilistic emotion distributions for speech and text. In Stage II, a conditional fusion module explicitly predicts the cross-modal sentiment consistency state. Crucially, our approach treats inconsistency detection as the primary task—unlike prior work—thereby enhancing both discrimination accuracy and interpretability. Evaluated on multiple benchmarks, our method significantly outperforms state-of-the-art approaches, achieving substantial gains in inconsistency identification accuracy and concurrent improvements in multimodal sentiment classification performance. These results demonstrate the effectiveness and robustness of the proposed framework.

Technology Category

Application Category

📝 Abstract
Detecting emotional inconsistency across modalities is a key challenge in affective computing, as speech and text often convey conflicting cues. Existing approaches generally rely on incomplete emotion representations and employ unconditional fusion, which weakens performance when modalities are inconsistent. Moreover, little prior work explicitly addresses inconsistency detection itself. We propose InconVAD, a two-stage framework grounded in the Valence/Arousal/Dominance (VAD) space. In the first stage, independent uncertainty-aware models yield robust unimodal predictions. In the second stage, a classifier identifies cross-modal inconsistency and selectively integrates consistent signals. Extensive experiments show that InconVAD surpasses existing methods in both multimodal emotion inconsistency detection and modeling, offering a more reliable and interpretable solution for emotion analysis.
Problem

Research questions and friction points this paper is trying to address.

Detecting emotional inconsistency across speech and text modalities
Addressing limitations of incomplete emotion representations and unconditional fusion
Providing reliable interpretable solution for multimodal emotion analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage dual-tower framework for multimodal detection
Independent uncertainty-aware unimodal prediction models
Classifier identifies and selectively integrates consistent signals
🔎 Similar Papers
No similar papers found.
Zongyi Li
Zongyi Li
MIT
Machine learningScientific computingNeural operator
J
Junchuan Zhao
School of Computing, National University of Singapore, Singapore
F
Francis Bu Sung Lee
College of Computing and Data Science, Nanyang Technological University, Singapore
A
Andrew Zi Han Yee
Wee Kim Wee School of Communication and Information, Nanyang Technological University, Singapore