🤖 AI Summary
Existing descriptive multimodal emotion recognition (DMER) evaluation relies either on costly human-annotated natural language descriptions or degenerates into coarse-grained classification, thereby losing critical affective dimensions—including temporal dynamics, intensity, and uncertainty. To address this, we propose the first ground-truth-free DMER evaluation paradigm: a pairwise ranking-based assessment framework. Our approach introduces three key contributions: (1) DMER-Preference, the first emotion preference dataset for evaluating descriptive quality; (2) a Bradley–Terry model to formally capture human preferences over emotion descriptions; and (3) an end-to-end automatic preference prediction method enabling scalable, annotation-free evaluation. By eliminating dependence on reference descriptions, our framework significantly improves evaluation efficiency and scalability while preserving fine-grained affective semantics. It establishes a robust, interpretable foundation for advanced emotion understanding and human–machine interaction applications.
📝 Abstract
Descriptive Multimodal Emotion Recognition (DMER) is a newly proposed task that aims to describe a person's emotional state using free-form natural language. Unlike traditional discriminative methods that rely on predefined emotion taxonomies, DMER provides greater flexibility in emotional expression, enabling fine-grained and interpretable emotion representations. However, this free-form prediction paradigm introduces significant challenges in evaluation. Existing methods either depend on ground-truth descriptions that require substantial manual effort or simplify the task by shifting the focus from evaluating descriptions to evaluating emotion labels. However, the former suffers from the labor-intensive collection of comprehensive descriptions, while the latter overlooks critical aspects such as emotional temporal dynamics, intensity, and uncertainty. To address these limitations, we propose DMER-Ranker, a novel evaluation strategy that reformulates the traditional ``prediction-ground truth'' comparison into the ``prediction-prediction'' comparison, eliminating the need for ground-truth descriptions. We then employ the Bradley-Terry algorithm to convert pairwise comparison results into model-level rankings. Additionally, we explore the possibility of automatic preference prediction and introduce DMER-Preference, the first preference dataset specifically designed for human emotions. Our work advances the field of DMER and lays the foundation for more intelligent human-computer interaction systems.