Semantic Matters: Multimodal Features for Affective Analysis

📅 2025-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two real-world affective and behavioral analysis challenges posed by the 8th ABAW Competition: Emotion Mimicry Intensity (EMI) regression estimation and Behavior Ambivalence/Hesitation (BAH) binary classification. We propose a multimodal fusion framework that—novelty—integrates VAD-derived speech representations from Wav2Vec 2.0 with BERT-based textual and ViT-based visual features, incorporating a cross-modal complementarity mechanism wherein visual cues enhance textual semantic parsing. Experimental results demonstrate that semantic modalities (text + vision) exhibit superior discriminative power over acoustic ones. On the test set, our method achieves a Pearson correlation coefficient of 0.706 on EMI (ranking first), and an F1-score of 0.702 on BAH (second place), validating the efficacy of semantic-driven modeling and synergistic multimodal integration.

Technology Category

Application Category

📝 Abstract
In this study, we present our methodology for two tasks: the Emotional Mimicry Intensity (EMI) Estimation Challenge and the Behavioural Ambivalence/Hesitancy (BAH) Recognition Challenge, both conducted as part of the 8th Workshop and Competition on Affective&Behavior Analysis in-the-wild. We utilize a Wav2Vec 2.0 model pre-trained on a large podcast dataset to extract various audio features, capturing both linguistic and paralinguistic information. Our approach incorporates a valence-arousal-dominance (VAD) module derived from Wav2Vec 2.0, a BERT text encoder, and a vision transformer (ViT) with predictions subsequently processed through a long short-term memory (LSTM) architecture or a convolution-like method for temporal modeling. We integrate the textual and visual modality into our analysis, recognizing that semantic content provides valuable contextual cues and underscoring that the meaning of speech often conveys more critical insights than its acoustic counterpart alone. Fusing in the vision modality helps in some cases to interpret the textual modality more precisely. This combined approach results in significant performance improvements, achieving in EMI $ ho_{ ext{TEST}} = 0.706$ and in BAH $F1_{ ext{TEST}} = 0.702$, securing first place in the EMI challenge and second place in the BAH challenge.
Problem

Research questions and friction points this paper is trying to address.

Estimating Emotional Mimicry Intensity using multimodal features
Recognizing Behavioural Ambivalence/Hesitancy through affective analysis
Improving performance by fusing audio, text, and visual modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Wav2Vec 2.0 for audio feature extraction
BERT and ViT for multimodal encoding
LSTM for temporal modeling fusion
🔎 Similar Papers
No similar papers found.
Tobias Hallmen
Tobias Hallmen
Doctoral candidate, Research assistant, University of Augsburg
Analysis of ConversationsLanguage ModelsNatural Language ProcessingMultimodal Deep Learning
R
Robin-Nico Kampa
Institute for Distributed Intelligent Systems, University of the Bundeswehr Munich
Fabian Deuser
Fabian Deuser
University of the Bundeswehr Munich
deep learningmultimodal deep learninggeo localisation
N
Norbert Oswald
Institute for Distributed Intelligent Systems, University of the Bundeswehr Munich
E
Elisabeth Andr'e
Chair for Human-Centered Artificial Intelligence, University of Augsburg