Semantic Matters: Multimodal Features for Affective Analysis

📅 2025-03-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses two real-world affective and behavioral analysis challenges posed by the 8th ABAW Competition: Emotion Mimicry Intensity (EMI) regression estimation and Behavior Ambivalence/Hesitation (BAH) binary classification. We propose a multimodal fusion framework that—novelty—integrates VAD-derived speech representations from Wav2Vec 2.0 with BERT-based textual and ViT-based visual features, incorporating a cross-modal complementarity mechanism wherein visual cues enhance textual semantic parsing. Experimental results demonstrate that semantic modalities (text + vision) exhibit superior discriminative power over acoustic ones. On the test set, our method achieves a Pearson correlation coefficient of 0.706 on EMI (ranking first), and an F1-score of 0.702 on BAH (second place), validating the efficacy of semantic-driven modeling and synergistic multimodal integration.

Technology Category

Application Category

📝 Abstract

In this study, we present our methodology for two tasks: the Emotional Mimicry Intensity (EMI) Estimation Challenge and the Behavioural Ambivalence/Hesitancy (BAH) Recognition Challenge, both conducted as part of the 8th Workshop and Competition on Affective&Behavior Analysis in-the-wild. We utilize a Wav2Vec 2.0 model pre-trained on a large podcast dataset to extract various audio features, capturing both linguistic and paralinguistic information. Our approach incorporates a valence-arousal-dominance (VAD) module derived from Wav2Vec 2.0, a BERT text encoder, and a vision transformer (ViT) with predictions subsequently processed through a long short-term memory (LSTM) architecture or a convolution-like method for temporal modeling. We integrate the textual and visual modality into our analysis, recognizing that semantic content provides valuable contextual cues and underscoring that the meaning of speech often conveys more critical insights than its acoustic counterpart alone. Fusing in the vision modality helps in some cases to interpret the textual modality more precisely. This combined approach results in significant performance improvements, achieving in EMI $ ho_{ ext{TEST}} = 0.706$ and in BAH $F1_{ ext{TEST}} = 0.702$, securing first place in the EMI challenge and second place in the BAH challenge.

Problem

Research questions and friction points this paper is trying to address.

Estimating Emotional Mimicry Intensity using multimodal features

Recognizing Behavioural Ambivalence/Hesitancy through affective analysis

Improving performance by fusing audio, text, and visual modalities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Wav2Vec 2.0 for audio feature extraction

BERT and ViT for multimodal encoding

LSTM for temporal modeling fusion

🔎 Similar Papers

No similar papers found.

Authors to Follow