🤖 AI Summary
Existing affective understanding benchmarks suffer from limitations in ecological validity, signal clarity, and reliability of fine-grained annotations, hindering the training and evaluation of empathetic models. This work proposes EmoS—a high-fidelity bilingual multimodal affective benchmark—that uniquely integrates rigorously curated static clips with dynamic streaming monologues. To reconcile ecological validity with signal quality, EmoS introduces a dual-layer human annotation protocol and a streaming affect annotation framework. Multimodal large language models fine-tuned on EmoS significantly outperform zero-shot baselines, demonstrating the benchmark’s effectiveness in supporting fine-grained, continuous affect modeling. The dataset and code are publicly released.
📝 Abstract
In the context of today's high-pressure, aging society, the demand for large-scale emotional models capable of providing empathetic support is more critical than ever. However, existing benchmarks fail to simultaneously achieve ecological validity, signal clarity, and reliable fine-grained labeling. We introduce EmoS, a high-fidelity bilingual benchmark designed to resolve the limitations of ecological validity and noise in existing datasets by combining strictly filtered static slices with a dynamic Streaming Monologue subset. Supported by a rigorous dual-layer human annotation pipeline, EmoS provides trusted ground truth that captures continuous emotional evolution. Empirical results show that fine-tuning MLLMs (multimodal large language models) on EmoS yields significant gains over zero-shot baselines, laying the foundation for the training and evaluation of future emotion recognition models and empathy models. The dataset and code are publicly available at https://github.com/NLP2CT/EmoS.