Bridging Speech Emotion Recognition and Personality: Dataset and Temporal Interaction Condition Network

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance limitation in speech emotion recognition (SER) caused by neglecting speaker-specific personality differences. We systematically investigate the statistical associations between personality traits and vocal emotional expression, and construct the first personality-annotated IEMOCAP dataset. To model dynamic personality–emotion interactions, we propose a Temporal Interactive Conditional Network (TICN), which fuses HuBERT acoustic features with personality embeddings via a temporal conditional attention mechanism. Furthermore, we design an end-to-end personality-aware SER framework integrating an automatic personality recognition (APR) front-end module, enabling emotion recognition without prior personality knowledge. Experiments demonstrate that incorporating ground-truth personality labels improves valence prediction Concordance Correlation Coefficient (CCC) to 0.785 (+12.4% over baseline); using APR-predicted personality yields CCC = 0.776 (+11.2%), significantly outperforming baselines and validating the effectiveness and practicality of personality-aware modeling.

Technology Category

Application Category

📝 Abstract
This study investigates the interaction between personality traits and emotional expression, exploring how personality information can improve speech emotion recognition (SER). We collected personality annotation for the IEMOCAP dataset, and the statistical analysis identified significant correlations between personality traits and emotional expressions. To extract finegrained personality features, we propose a temporal interaction condition network (TICN), in which personality features are integrated with Hubert-based acoustic features for SER. Experiments show that incorporating ground-truth personality traits significantly enhances valence recognition, improving the concordance correlation coefficient (CCC) from 0.698 to 0.785 compared to the baseline without personality information. For practical applications in dialogue systems where personality information about the user is unavailable, we develop a front-end module of automatic personality recognition. Using these automatically predicted traits as inputs to our proposed TICN model, we achieve a CCC of 0.776 for valence recognition, representing an 11.17% relative improvement over the baseline. These findings confirm the effectiveness of personality-aware SER and provide a solid foundation for further exploration in personality-aware speech processing applications.
Problem

Research questions and friction points this paper is trying to address.

Investigates personality-emotion interaction in speech recognition
Proposes TICN model integrating personality with acoustic features
Enhances valence recognition using personality-aware SER approach
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal interaction condition network for SER
Hubert-based acoustic features integration
Automatic personality recognition module
🔎 Similar Papers
No similar papers found.
Y
Yuan Gao
Department of Intelligence Science and Technology, School of Informatics, Kyoto University, Kyoto, Japan
H
Hao Shi
Department of Intelligence Science and Technology, School of Informatics, Kyoto University, Kyoto, Japan
Yahui Fu
Yahui Fu
Kyoto University
Dialogue SystemNatural Language ProcessingAffective Computing
Chenhui Chu
Chenhui Chu
Kyoto University
Machine TranslationNatural Language ProcessingVision and LanguageSpeech Processing
Tatsuya Kawahara
Tatsuya Kawahara
Professor, School of Informatics, Kyoto University
Speech Processingspeech recognitionNatural Language Processingdialogue