EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice Conversion

📅 2024-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In emotional voice conversion (EVC), fine-grained, continuous control of emotion intensity remains challenging due to low precision, degraded speech quality, and style distortion. This paper proposes a diffusion-based EVC framework with explicit emotion intensity control. Its core innovation is the first integration of unsupervised directional latent vector modeling (DVM) into diffusion-based EVC, enabling continuous and interpretable regularization of emotion intensity. Coupled with self-supervised emotional feature learning and a latent-space vector translation–reverse-diffusion fusion mechanism, the approach eliminates reliance on discrete emotion labels. Evaluated on English and Hindi datasets, our method outperforms all state-of-the-art approaches in both objective metrics (F0/energy alignment, SIM) and subjective assessments (MOS), achieving superior trade-offs between speech fidelity and emotion accuracy.

Technology Category

Application Category

📝 Abstract
The Emotional Voice Conversion (EVC) aims to convert the discrete emotional state from the source emotion to the target for a given speech utterance while preserving linguistic content. In this paper, we propose regularizing emotion intensity in the diffusion-based EVC framework to generate precise speech of the target emotion. Traditional approaches control the intensity of an emotional state in the utterance via emotion class probabilities or intensity labels that often lead to inept style manipulations and degradations in quality. On the contrary, we aim to regulate emotion intensity using self-supervised learning-based feature representations and unsupervised directional latent vector modeling (DVM) in the emotional embedding space within a diffusion-based framework. These emotion embeddings can be modified based on the given target emotion intensity and the corresponding direction vector. Furthermore, the updated embeddings can be fused in the reverse diffusion process to generate the speech with the desired emotion and intensity. In summary, this paper aims to achieve high-quality emotional intensity regularization in the diffusion-based EVC framework, which is the first of its kind work. The effectiveness of the proposed method has been shown across state-of-the-art (SOTA) baselines in terms of subjective and objective evaluations for the English and Hindi languages footnote{Demo samples are available at the following URL: url{https://nirmesh-sony.github.io/EmoReg/}}.
Problem

Research questions and friction points this paper is trying to address.

Emotional Voice Conversion
Intensity Adjustment
Speech Quality Degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

EmoReg
Self-supervised Learning
Unsupervised Directional Vector Modeling
🔎 Similar Papers
No similar papers found.
A
Ashish Gudmalwar
Media Analysis Group, Sony Research India, Bangalore
I
Ishan D. Biyani
Media Analysis Group, Sony Research India, Bangalore
Nirmesh Shah
Nirmesh Shah
Sony Research India
Voice ConversionVoice TransformationSpeech SynthesisSpeech Recognition
Pankaj Wasnik
Pankaj Wasnik
Sony Research
Computer VisionBiometricsMachine TranslationSpeech Generation
R
R. Shah
Indraprastha Institute of Information Technology (IIIT), Delhi, India