Multimodal Emotion Regression with Multi-Objective Optimization and VAD-Aware Audio Modeling for the 10th ABAW EMI Track

📅 2026-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the regression of intensity levels for six continuous emotional dimensions—admiration, amusement, determination, empathic pain, excitement, and joy—on the Hume-Vidmimic2 dataset. The proposed approach leverages a simple concatenation of multimodal pretrained features, combined with a shared six-dimensional regression head, joint optimization of mean squared error and Pearson correlation coefficient, an auxiliary supervision branch, exponential moving average (EMA) parameter smoothing, and an acoustic latent prior inspired by the Valence-Arousal-Dominance (VAD) framework. The study finds that feature-level concatenation outperforms more complex fusion strategies and advocates three design principles: preserving modality-specific characteristics, aligning multi-objective optimization with evaluation metrics, and incorporating VAD-aware audio representations. The method achieves a state-of-the-art average Pearson correlation coefficient of 0.4786 on the official validation set.

Technology Category

Application Category

📝 Abstract
We participated in the 10th ABAW Challenge, focusing on the Emotional Mimicry Intensity (EMI) Estimation track on the Hume-Vidmimic2 dataset. This task aims to predict six continuous emotion dimensions: Admiration, Amusement, Determination, Empathic Pain, Excitement, and Joy. Through systematic multimodal exploration of pretrained high-level features, we found that, under our pretrained feature setting, direct feature concatenation outperformed the more complex fusion strategies we tested. This empirical finding motivated us to design a systematic approach built upon three core principles: (i) preserving modality-specific attributes through feature-level concatenation; (ii) improving training stability and metric alignment via multi-objective optimization; and (iii) enriching acoustic representations with a VAD-inspired latent prior. Our final framework integrates concatenation-based multimodal fusion, a shared six-dimensional regression head, multi-objective optimization with MSE, Pearson-correlation, and auxiliary branch supervision, EMA for parameter stabilization, and a VAD-inspired latent prior for the acoustic branch. On the official validation set, the proposed scheme achieved our best mean Pearson Correlation Coefficient of 0.478567.
Problem

Research questions and friction points this paper is trying to address.

Emotional Mimicry Intensity
Multimodal Emotion Regression
Continuous Emotion Dimensions
ABAW Challenge
Hume-Vidmimic2
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal fusion
multi-objective optimization
VAD-inspired latent prior
emotion regression
feature concatenation
🔎 Similar Papers
No similar papers found.
J
Jiawen Huang
School of Artificial Intelligence, University of Chinese Academy of Sciences; Institute of Automation, Chinese Academy of Sciences
C
Chenxi Huang
School of Artificial Intelligence, University of Chinese Academy of Sciences; Institute of Automation, Chinese Academy of Sciences
Z
Zhuofan Wen
School of Artificial Intelligence, University of Chinese Academy of Sciences; Institute of Automation, Chinese Academy of Sciences
H
Hailiang Yao
Tianjin Normal University; Institute of Automation, Chinese Academy of Sciences
Shun Chen
Shun Chen
中国科学院自动化研究所
情感计算、人机交互、深度学习
L
Longjiang Yang
School of Artificial Intelligence, University of Chinese Academy of Sciences; Institute of Automation, Chinese Academy of Sciences
Cong Yu
Cong Yu
Head of Engineering, Dandy
ML / Language ModelML / Computer Vision3D/CADProcess MiningData Mining
F
Fengyu Zhang
School of Artificial Intelligence, University of Chinese Academy of Sciences; Institute of Automation, Chinese Academy of Sciences
R
Ran Liu
Tianjin Normal University; Institute of Automation, Chinese Academy of Sciences
Bin Liu
Bin Liu
Institute of Automation,Chinese Academy of Sciences
Pattern RecognitionAffective ComputingSpeech ProcessingHuman-machine interaction