🤖 AI Summary
This work proposes Emotion Transcription for Conversation (ETC), a novel task that replaces conventional discrete or dimensional emotion labels with natural language descriptions to more accurately capture the nuanced, complex, and culturally specific emotional states of speakers in dialogues. To support this approach, the authors construct the first multimodal dataset comprising Japanese conversations paired with self-reported emotion narratives. They fine-tune language models to establish baseline performance on this task. Experimental results demonstrate strong capabilities in recognizing explicit emotions, yet highlight ongoing challenges in inferring implicit emotional cues. The released dataset offers a new benchmark for fine-grained, culturally aware emotion modeling in conversational contexts.
📝 Abstract
Emotion Recognition in Conversation (ERC) is critical for enabling natural human-machine interactions. However, existing methods predominantly employ categorical or dimensional emotion annotations, which often fail to adequately represent complex, subtle, or culturally specific emotional nuances. To overcome this limitation, we propose a novel task named Emotion Transcription in Conversation (ETC). This task focuses on generating natural language descriptions that accurately reflect speakers'emotional states within conversational contexts. To address the ETC, we constructed a Japanese dataset comprising text-based dialogues annotated with participants'self-reported emotional states, described in natural language. The dataset also includes emotion category labels for each transcription, enabling quantitative analysis and its application to ERC. We benchmarked baseline models, finding that while fine-tuning on our dataset enhances model performance, current models still struggle to infer implicit emotional states. The ETC task will encourage further research into more expressive emotion understanding in dialogue. The dataset is publicly available at https://github.com/UEC-InabaLab/ETCDataset.