Gen-SER: When the generative model meets speech emotion recognition

📅 2026-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that traditional speech emotion recognition methods struggle to model the intrinsic structure and distributional characteristics of emotion labels. The authors reformulate the task as a distribution transfer problem and, for the first time, introduce generative modeling and distribution transfer principles: a generative model maps discrete emotion labels into a continuous semantic space, a target distribution is constructed using sinusoidal classification encoding, and a target-matching generation strategy facilitates distribution transformation. Classification is ultimately performed by measuring the similarity between the generated and true distributions. The proposed method demonstrates strong performance across multiple speech emotion recognition benchmarks, exhibits excellent generalization capability, and shows promising potential for extension to other classification tasks.

Technology Category

Application Category

📝 Abstract
Speech emotion recognition (SER) is crucial in speech understanding and generation. Most approaches are based on either classification models or large language models. Different from previous methods, we propose Gen-SER, a novel approach that reformulates SER as a distribution shift problem via generative models. We propose to project discrete class labels into a continuous space, and obtain the terminal distribution via sinusoidal taxonomy encoding. The target-matching-based generative model is adopted to transform the initial distribution into the terminal distribution efficiently. The classification is achieved by calculating the similarity of the generated terminal distribution and ground truth terminal distribution. The experimental results confirm the efficacy of the proposed method, demonstrating its extensibility to various speech-understanding tasks and suggesting its potential applicability to a broader range of classification tasks.
Problem

Research questions and friction points this paper is trying to address.

Speech Emotion Recognition
Generative Model
Distribution Shift
Classification
Continuous Representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

generative model
speech emotion recognition
distribution shift
sinusoidal taxonomy encoding
target-matching
🔎 Similar Papers
No similar papers found.
Taihui Wang
Taihui Wang
Institute of Acoustics, Chinese Academy of Sciences
statistical signal processingblind source seperationspeech dereverberation
J
Jinzheng Zhao
Tencent Multimodal Models Department, Beijing, China
R
Rilin Chen
Tencent Multimodal Models Department, Beijing, China
T
Tong Lei
Tencent AI Lab, Beijing, China
Wenwu Wang
Wenwu Wang
Professor, University of Surrey, UK
signal processingmachine learningmachine listeningaudio/speech/audio-visualmultimodal fusion
D
Dong Yu
Tencent AI Lab, Bellevue, WA, USA