🤖 AI Summary
This work addresses the degradation of emotional expressiveness in streaming speech anonymization, a common issue arising from speaker emotion information loss in neural audio codec language models. To preserve emotion without increasing inference latency, the authors propose a novel approach that leverages neutral-to-emotional speech pairs from the same speaker for supervised fine-tuning and introduces a frame-level emotion distillation mechanism applied to acoustic token hidden states. This is the first study to integrate frame-level emotion distillation into streaming speech anonymization. Evaluated under the VoicePrivacy 2024 protocol, the method achieves a 49.2% unweighted average recall (UAR) for emotion recognition—representing a 24% relative improvement—while maintaining a word error rate (WER) of 5.77% and strong privacy protection with a 49.0% equal error rate (EER).
📝 Abstract
We address the challenge of preserving emotional content in streaming speaker anonymization (SA). Neural audio codec language models trained for audio continuation tend to degrade source emotion: content tokens discard emotional information, and the model defaults to dominant acoustic patterns rather than preserving paralinguistic attributes. We propose supervised finetuning with neutral-emotion utterance pairs from the same speaker, combined with frame-level emotion distillation on acoustic token hidden states. All modifications are confined to finetuning, which takes less than 2 hours on 4 GPUs and adds zero inference latency overhead, while maintaining a competitive 180ms streaming latency. On the VoicePrivacy 2024 protocol, our approach achieves a 49.2% UAR (emotion preservation) with 5.77% WER (intelligibility), a +24% relative UAR improvement over the baseline (39.7%->49.2%) and +10% over the emotion-prompt variant (44.6% UAR), while maintaining strong privacy (EER 49.0%). Demo and code are available: https://anonymous3842031239.github.io/