🤖 AI Summary
This work addresses the degradation of emotional cues in existing neural speech codecs during quantization, which often compromises the balance among semantic fidelity, prosodic naturalness, and emotional expressiveness. To mitigate this issue, the authors propose an emotion-guided end-to-end neural codec framework that explicitly preserves emotionally salient features in compressed representations through three key mechanisms: emotion–semantic guided latent modulation, relation-preserving emotion–semantic knowledge distillation, and emotion-weighted semantic alignment. Experimental results demonstrate that the proposed approach significantly enhances emotional consistency and perceptual quality in speech reconstruction, emotion recognition, and downstream text-to-speech synthesis, while maintaining high content accuracy.
📝 Abstract
Neural speech codecs provide discrete representations for speech language models, but emotional cues are often degraded during quantization. Existing codecs mainly optimize acoustic reconstruction, leaving emotion expressiveness insufficiently modeled at the representation level. We propose an emotion-guided neural speech codec that explicitly preserves emotional information while maintaining semantic fidelity and prosodic naturalness. Our framework combines emotion-semantic guided latent modulation, relation-preserving emotional-semantic distillation, and emotion-weighted semantic alignment to retain emotionally salient cues under compression. Extensive evaluations across speech reconstruction, emotion recognition, and downstream text-to-speech generation demonstrate improved emotion consistency and perceptual quality without sacrificing content accuracy.