🤖 AI Summary
This work addresses the limitation of existing speech-based emotion description systems, which typically model only static, single emotions within isolated utterances and fail to capture dynamic emotional shifts at the discourse level. To overcome this, the authors propose a novel emotion-transition-aware paradigm for speech description, introducing the first large-scale discourse-level emotion transition dataset. They design an automated pipeline that integrates acoustic features with temporal cues to generate both descriptive and instructional annotations via large language models. Through multitask learning, discourse-level temporal modeling, and emotional speech synthesis, the proposed approach achieves fine-grained, dynamically continuous emotion understanding and expression, significantly enhancing the emotional anthropomorphism of conversational agents.
📝 Abstract
Emotion perception and adaptive expression are fundamental capabilities in human-agent interaction. While recent advances in speech emotion captioning (SEC) have improved fine-grained emotional modeling, existing systems remain limited to static, single-emotion characterization within isolated sentences, neglecting dynamic emotional transitions at the discourse level. To address this gap, we propose Emotion Transition-Aware Speech Captioning (EmoTransCap), a paradigm that integrates temporal emotion dynamics with discourse-level speech description. To construct a dataset rich in emotion transitions while enabling scalable expansion, we design an automated pipeline for dataset creation. This is the first large-scale dataset explicitly designed to capture discourse-level emotion transitions. To generate semantically rich descriptions, we incorporate acoustic attributes and temporal cues from discourse-level speech. Our Multi-Task Emotion Transition Recognition (MTETR) model performs joint emotion transition detection and diarization. Leveraging the semantic analysis capabilities of LLMs, we produce two annotation versions: descriptive and instruction-oriented. These data and annotations offer a valuable resource for advancing emotion perception and emotional expressiveness. The dataset enables speech captions that capture emotional transitions, facilitating temporal-dynamic and fine-grained emotion understanding. We also introduce a controllable, transition-aware emotional speech synthesis system at the discourse level, enhancing anthropomorphic emotional expressiveness and supporting emotionally intelligent conversational agents.