StreamVoiceAnon+: Emotion-Preserving Streaming Speaker Anonymization via Frame-Level Acoustic Distillation

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the degradation of emotional expressiveness in streaming speech anonymization, a common issue arising from speaker emotion information loss in neural audio codec language models. To preserve emotion without increasing inference latency, the authors propose a novel approach that leverages neutral-to-emotional speech pairs from the same speaker for supervised fine-tuning and introduces a frame-level emotion distillation mechanism applied to acoustic token hidden states. This is the first study to integrate frame-level emotion distillation into streaming speech anonymization. Evaluated under the VoicePrivacy 2024 protocol, the method achieves a 49.2% unweighted average recall (UAR) for emotion recognition—representing a 24% relative improvement—while maintaining a word error rate (WER) of 5.77% and strong privacy protection with a 49.0% equal error rate (EER).

Technology Category

Application Category

📝 Abstract
We address the challenge of preserving emotional content in streaming speaker anonymization (SA). Neural audio codec language models trained for audio continuation tend to degrade source emotion: content tokens discard emotional information, and the model defaults to dominant acoustic patterns rather than preserving paralinguistic attributes. We propose supervised finetuning with neutral-emotion utterance pairs from the same speaker, combined with frame-level emotion distillation on acoustic token hidden states. All modifications are confined to finetuning, which takes less than 2 hours on 4 GPUs and adds zero inference latency overhead, while maintaining a competitive 180ms streaming latency. On the VoicePrivacy 2024 protocol, our approach achieves a 49.2% UAR (emotion preservation) with 5.77% WER (intelligibility), a +24% relative UAR improvement over the baseline (39.7%->49.2%) and +10% over the emotion-prompt variant (44.6% UAR), while maintaining strong privacy (EER 49.0%). Demo and code are available: https://anonymous3842031239.github.io/
Problem

Research questions and friction points this paper is trying to address.

speaker anonymization
emotion preservation
streaming speech
paralinguistic attributes
neural audio codec
Innovation

Methods, ideas, or system contributions that make the work stand out.

emotion preservation
speaker anonymization
frame-level distillation
streaming audio
acoustic token
🔎 Similar Papers
No similar papers found.
N
Nikita Kuzmin
Nanyang Technological University, Singapore; Institute for Infocomm Research (I2R), A*STAR, Singapore
Kong Aik Lee
Kong Aik Lee
The Hong Kong Polytechnic University, Hong Kong
Speaker and Spoken Language RecognitionSpeech ProcessingDigital Signal ProcessingSubband
E
Eng Siong Chng
Nanyang Technological University, Singapore