Multimodal Emotion Recognition and Sentiment Analysis in Multi-Party Conversation Contexts

📅 2025-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses multimodal sentiment recognition and sentiment polarity analysis in realistic multi-party conversational scenarios. To overcome insufficient modeling of dynamic cross-modal coupling during multi-speaker interactions, we propose— for the first time—a four-modal (text, speech, facial, and video) collaborative modeling framework: RoBERTa, Wav2Vec 2.0, a custom lightweight FacialNet, and an end-to-end CNN-Transformer video encoder are employed for modality-specific feature extraction; features are then fused and jointly classified. Our core innovations include cross-modal temporal alignment modeling and a dialogue-oriented lightweight visual representation design. Evaluated on standard multi-party dialogue benchmarks, our method achieves 66.36% accuracy for emotion recognition and 72.15% for sentiment analysis—significantly outperforming all unimodal baselines—demonstrating the effectiveness of multimodal collaborative modeling.

Technology Category

Application Category

📝 Abstract
Emotion recognition and sentiment analysis are pivotal tasks in speech and language processing, particularly in real-world scenarios involving multi-party, conversational data. This paper presents a multimodal approach to tackle these challenges on a well-known dataset. We propose a system that integrates four key modalities/channels using pre-trained models: RoBERTa for text, Wav2Vec2 for speech, a proposed FacialNet for facial expressions, and a CNN+Transformer architecture trained from scratch for video analysis. Feature embeddings from each modality are concatenated to form a multimodal vector, which is then used to predict emotion and sentiment labels. The multimodal system demonstrates superior performance compared to unimodal approaches, achieving an accuracy of 66.36% for emotion recognition and 72.15% for sentiment analysis.
Problem

Research questions and friction points this paper is trying to address.

Multimodal emotion recognition in multi-party conversations.
Sentiment analysis using integrated text, speech, facial, and video data.
Improving accuracy over unimodal approaches in emotion and sentiment prediction.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates RoBERTa, Wav2Vec2, FacialNet, CNN+Transformer
Concatenates multimodal feature embeddings for predictions
Achieves 66.36% emotion, 72.15% sentiment accuracy