๐ค AI Summary
This study addresses the limitation of existing video-based emotion recognition methods, which typically focus on single emotions and struggle to model the mixed emotions and their relative component saliencies commonly observed in real-world scenarios. To this end, we introduce BLEMORE, a multimodal dataset comprising over 3,000 video clips from 58 actors, covering six basic and ten mixed emotions, along with the first large-scale, fine-grained annotations of emotion component saliency (e.g., 50/50, 70/30). Leveraging this dataset, we jointly model emotion presence and saliency, evaluating state-of-the-art multimodal architectures including ImageBind, WavLM, HiCMAE, VideoMAEv2, and HuBERT. Experimental results show that the best multimodal approach achieves 35% and 33% presence accuracy on the validation and test sets, respectively, and 18% saliency accuracy, establishing a foundation for future research on complex emotion recognition.
๐ Abstract
Humans often experience not just a single basic emotion at a time, but rather a blend of several emotions with varying salience. Despite the importance of such blended emotions, most video-based emotion recognition approaches are designed to recognize single emotions only. The few approaches that have attempted to recognize blended emotions typically cannot assess the relative salience of the emotions within a blend. This limitation largely stems from the lack of datasets containing a substantial number of blended emotion samples annotated with relative salience. To address this shortcoming, we introduce BLEMORE, a novel dataset for multimodal (video, audio) blended emotion recognition that includes information on the relative salience of each emotion within a blend. BLEMORE comprises over 3,000 clips from 58 actors, performing 6 basic emotions and 10 distinct blends, where each blend has 3 different salience configurations (50/50, 70/30, and 30/70). Using this dataset, we conduct extensive evaluations of state-of-the-art video classification approaches on two blended emotion prediction tasks: (1) predicting the presence of emotions in a given sample, and (2) predicting the relative salience of emotions in a blend. Our results show that unimodal classifiers achieve up to 29% presence accuracy and 13% salience accuracy on the validation set, while multimodal methods yield clear improvements, with ImageBind + WavLM reaching 35% presence accuracy and HiCMAE 18% salience accuracy. On the held-out test set, the best models achieve 33% presence accuracy (VideoMAEv2 + HuBERT) and 18% salience accuracy (HiCMAE). In sum, the BLEMORE dataset provides a valuable resource to advancing research on emotion recognition systems that account for the complexity and significance of blended emotion expressions.