Tracing Intricate Cues in Dialogue: Joint Graph Structure and Sentiment Dynamics for Multimodal Emotion Recognition

📅 2024-07-31

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Multimodal Emotion Recognition in Conversations (MERC) faces three key challenges: insufficient exploitation of cross-modal cues, multi-source conflicts arising from intra-layer fusion, and difficulty modeling dynamic emotional shifts. To address these, we propose GraphSmile—a novel framework featuring two synergistic modules: Hierarchical Alternating Graph-structured Fusion (GSF) and Explicit Sentiment Dynamics Prediction (SDP). GSF decouples intra- and cross-modal modeling via layered alternating graph neural networks, while SDP explicitly captures temporal sentiment dynamics through an auxiliary sequence modeling task. The method integrates graph neural networks, cross-modal dependency modeling, and multimodal feature disentanglement, enabling unified support for both MERC and Multimodal Sentiment Analysis in Conversations (MSAC). Extensive experiments on multiple benchmark datasets demonstrate significant improvements over state-of-the-art methods, particularly in detecting emotionally abrupt utterances and capturing fine-grained cross-modal affective cues.

Technology Category

Application Category

📝 Abstract

Multimodal emotion recognition in conversation (MERC) has garnered substantial research attention recently. Existing MERC methods face several challenges: (1) they fail to fully harness direct inter-modal cues, possibly leading to less-than-thorough cross-modal modeling; (2) they concurrently extract information from the same and different modalities at each network layer, potentially triggering conflicts from the fusion of multi-source data; (3) they lack the agility required to detect dynamic sentimental changes, perhaps resulting in inaccurate classification of utterances with abrupt sentiment shifts. To address these issues, a novel approach named GraphSmile is proposed for tracking intricate emotional cues in multimodal dialogues. GraphSmile comprises two key components, i.e., GSF and SDP modules. GSF ingeniously leverages graph structures to alternately assimilate inter-modal and intra-modal emotional dependencies layer by layer, adequately capturing cross-modal cues while effectively circumventing fusion conflicts. SDP is an auxiliary task to explicitly delineate the sentiment dynamics between utterances, promoting the model's ability to distinguish sentimental discrepancies. Furthermore, GraphSmile is effortlessly applied to multimodal sentiment analysis in conversation (MSAC), forging a unified multimodal affective model capable of executing MERC and MSAC tasks. Empirical results on multiple benchmarks demonstrate that GraphSmile can handle complex emotional and sentimental patterns, significantly outperforming baseline models.

Problem

Research questions and friction points this paper is trying to address.

Inadequate cross-modal modeling in emotion recognition

Conflicts from multi-source data fusion

Difficulty detecting dynamic sentiment changes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages graph structures for inter-modal dependencies

Alternates intra-modal and inter-modal assimilation layers

Explicitly models sentiment dynamics between utterances

🔎 Similar Papers

Multimodal Fusion via Hypergraph Autoencoder and Contrastive Learning for Emotion Recognition in Conversation