EmoTransCap: Dataset and Pipeline for Emotion Transition-Aware Speech Captioning in Discourses

📅 2026-04-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

185K/year
🤖 AI Summary
This work addresses the limitation of existing speech-based emotion description systems, which typically model only static, single emotions within isolated utterances and fail to capture dynamic emotional shifts at the discourse level. To overcome this, the authors propose a novel emotion-transition-aware paradigm for speech description, introducing the first large-scale discourse-level emotion transition dataset. They design an automated pipeline that integrates acoustic features with temporal cues to generate both descriptive and instructional annotations via large language models. Through multitask learning, discourse-level temporal modeling, and emotional speech synthesis, the proposed approach achieves fine-grained, dynamically continuous emotion understanding and expression, significantly enhancing the emotional anthropomorphism of conversational agents.
📝 Abstract
Emotion perception and adaptive expression are fundamental capabilities in human-agent interaction. While recent advances in speech emotion captioning (SEC) have improved fine-grained emotional modeling, existing systems remain limited to static, single-emotion characterization within isolated sentences, neglecting dynamic emotional transitions at the discourse level. To address this gap, we propose Emotion Transition-Aware Speech Captioning (EmoTransCap), a paradigm that integrates temporal emotion dynamics with discourse-level speech description. To construct a dataset rich in emotion transitions while enabling scalable expansion, we design an automated pipeline for dataset creation. This is the first large-scale dataset explicitly designed to capture discourse-level emotion transitions. To generate semantically rich descriptions, we incorporate acoustic attributes and temporal cues from discourse-level speech. Our Multi-Task Emotion Transition Recognition (MTETR) model performs joint emotion transition detection and diarization. Leveraging the semantic analysis capabilities of LLMs, we produce two annotation versions: descriptive and instruction-oriented. These data and annotations offer a valuable resource for advancing emotion perception and emotional expressiveness. The dataset enables speech captions that capture emotional transitions, facilitating temporal-dynamic and fine-grained emotion understanding. We also introduce a controllable, transition-aware emotional speech synthesis system at the discourse level, enhancing anthropomorphic emotional expressiveness and supporting emotionally intelligent conversational agents.
Problem

Research questions and friction points this paper is trying to address.

emotion transition
speech captioning
discourse-level emotion
dynamic emotion modeling
human-agent interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

emotion transition
speech captioning
discourse-level modeling
multi-task learning
controllable emotional synthesis
🔎 Similar Papers