EmoTransCap: Dataset and Pipeline for Emotion Transition-Aware Speech Captioning in Discourses

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the limitation of existing speech-based emotion description systems, which typically model only static, single emotions within isolated utterances and fail to capture dynamic emotional shifts at the discourse level. To overcome this, the authors propose a novel emotion-transition-aware paradigm for speech description, introducing the first large-scale discourse-level emotion transition dataset. They design an automated pipeline that integrates acoustic features with temporal cues to generate both descriptive and instructional annotations via large language models. Through multitask learning, discourse-level temporal modeling, and emotional speech synthesis, the proposed approach achieves fine-grained, dynamically continuous emotion understanding and expression, significantly enhancing the emotional anthropomorphism of conversational agents.

📝 Abstract

Emotion perception and adaptive expression are fundamental capabilities in human-agent interaction. While recent advances in speech emotion captioning (SEC) have improved fine-grained emotional modeling, existing systems remain limited to static, single-emotion characterization within isolated sentences, neglecting dynamic emotional transitions at the discourse level. To address this gap, we propose Emotion Transition-Aware Speech Captioning (EmoTransCap), a paradigm that integrates temporal emotion dynamics with discourse-level speech description. To construct a dataset rich in emotion transitions while enabling scalable expansion, we design an automated pipeline for dataset creation. This is the first large-scale dataset explicitly designed to capture discourse-level emotion transitions. To generate semantically rich descriptions, we incorporate acoustic attributes and temporal cues from discourse-level speech. Our Multi-Task Emotion Transition Recognition (MTETR) model performs joint emotion transition detection and diarization. Leveraging the semantic analysis capabilities of LLMs, we produce two annotation versions: descriptive and instruction-oriented. These data and annotations offer a valuable resource for advancing emotion perception and emotional expressiveness. The dataset enables speech captions that capture emotional transitions, facilitating temporal-dynamic and fine-grained emotion understanding. We also introduce a controllable, transition-aware emotional speech synthesis system at the discourse level, enhancing anthropomorphic emotional expressiveness and supporting emotionally intelligent conversational agents.

Problem

Research questions and friction points this paper is trying to address.

emotion transition

speech captioning

discourse-level emotion

dynamic emotion modeling

human-agent interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

emotion transition

speech captioning

discourse-level modeling