🤖 AI Summary
This work addresses multimodal sentiment recognition and sentiment polarity analysis in realistic multi-party conversational scenarios. To overcome insufficient modeling of dynamic cross-modal coupling during multi-speaker interactions, we propose— for the first time—a four-modal (text, speech, facial, and video) collaborative modeling framework: RoBERTa, Wav2Vec 2.0, a custom lightweight FacialNet, and an end-to-end CNN-Transformer video encoder are employed for modality-specific feature extraction; features are then fused and jointly classified. Our core innovations include cross-modal temporal alignment modeling and a dialogue-oriented lightweight visual representation design. Evaluated on standard multi-party dialogue benchmarks, our method achieves 66.36% accuracy for emotion recognition and 72.15% for sentiment analysis—significantly outperforming all unimodal baselines—demonstrating the effectiveness of multimodal collaborative modeling.
📝 Abstract
Emotion recognition and sentiment analysis are pivotal tasks in speech and language processing, particularly in real-world scenarios involving multi-party, conversational data. This paper presents a multimodal approach to tackle these challenges on a well-known dataset. We propose a system that integrates four key modalities/channels using pre-trained models: RoBERTa for text, Wav2Vec2 for speech, a proposed FacialNet for facial expressions, and a CNN+Transformer architecture trained from scratch for video analysis. Feature embeddings from each modality are concatenated to form a multimodal vector, which is then used to predict emotion and sentiment labels. The multimodal system demonstrates superior performance compared to unimodal approaches, achieving an accuracy of 66.36% for emotion recognition and 72.15% for sentiment analysis.