🤖 AI Summary
This work addresses the challenge that traditional speech emotion recognition methods struggle to model the intrinsic structure and distributional characteristics of emotion labels. The authors reformulate the task as a distribution transfer problem and, for the first time, introduce generative modeling and distribution transfer principles: a generative model maps discrete emotion labels into a continuous semantic space, a target distribution is constructed using sinusoidal classification encoding, and a target-matching generation strategy facilitates distribution transformation. Classification is ultimately performed by measuring the similarity between the generated and true distributions. The proposed method demonstrates strong performance across multiple speech emotion recognition benchmarks, exhibits excellent generalization capability, and shows promising potential for extension to other classification tasks.
📝 Abstract
Speech emotion recognition (SER) is crucial in speech understanding and generation. Most approaches are based on either classification models or large language models. Different from previous methods, we propose Gen-SER, a novel approach that reformulates SER as a distribution shift problem via generative models. We propose to project discrete class labels into a continuous space, and obtain the terminal distribution via sinusoidal taxonomy encoding. The target-matching-based generative model is adopted to transform the initial distribution into the terminal distribution efficiently. The classification is achieved by calculating the similarity of the generated terminal distribution and ground truth terminal distribution. The experimental results confirm the efficacy of the proposed method, demonstrating its extensibility to various speech-understanding tasks and suggesting its potential applicability to a broader range of classification tasks.