🤖 AI Summary
This work addresses the challenge of recognizing mixed emotions, where multimodal cues are often weak and overlapping, making it difficult to rely on a single dominant signal for modeling. To this end, the authors propose a ranking-aware multi-encoder framework that employs attention gating to assess the importance of each modality-specific encoder and selectively fuses only the top-n most informative features. The emotion prediction task is further decoupled into two branches—existence and salience—with probabilistic-level fusion and unsupervised domain adaptation integrated to enhance robustness. Notably, the method achieves effective feature alignment without requiring pseudo-labels and significantly outperforms strong baselines on the BlEmoRE challenge, securing second place in the competition.
📝 Abstract
Blended emotion recognition is challenging because emotions are often expressed as mixtures of subtle and overlapping multimodal cues rather than a single dominant signal. We propose a rank-aware multi-encoder framework that selectively combines complementary representations from diverse pre-extracted video and audio encoders. Our method projects heterogeneous encoder features into a shared latent space, estimates sample-wise encoder importance through an attention-based gating module, and fuses only the top-n most informative encoders. To better model blended emotions, we decouple prediction into presence and salience heads and align them through probability-level fusion. We further incorporate feature-level unsupervised domain adaptation without pseudo-labeling to improve robustness under distribution shift. Experiments on the BlEmoRE challenge show that the proposed framework outperforms strong individual encoders and naïve multi-encoder fusion baselines. Our final system ranked 2nd in the competition, supporting the effectiveness of rank-aware selective fusion for fine-grained blended emotion recognition.