StreamVoiceAnon+: Emotion-Preserving Streaming Speaker Anonymization via Frame-Level Acoustic Distillation

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the degradation of emotional expressiveness in streaming speech anonymization, a common issue arising from speaker emotion information loss in neural audio codec language models. To preserve emotion without increasing inference latency, the authors propose a novel approach that leverages neutral-to-emotional speech pairs from the same speaker for supervised fine-tuning and introduces a frame-level emotion distillation mechanism applied to acoustic token hidden states. This is the first study to integrate frame-level emotion distillation into streaming speech anonymization. Evaluated under the VoicePrivacy 2024 protocol, the method achieves a 49.2% unweighted average recall (UAR) for emotion recognition—representing a 24% relative improvement—while maintaining a word error rate (WER) of 5.77% and strong privacy protection with a 49.0% equal error rate (EER).

Technology Category

Application Category

📝 Abstract

We address the challenge of preserving emotional content in streaming speaker anonymization (SA). Neural audio codec language models trained for audio continuation tend to degrade source emotion: content tokens discard emotional information, and the model defaults to dominant acoustic patterns rather than preserving paralinguistic attributes. We propose supervised finetuning with neutral-emotion utterance pairs from the same speaker, combined with frame-level emotion distillation on acoustic token hidden states. All modifications are confined to finetuning, which takes less than 2 hours on 4 GPUs and adds zero inference latency overhead, while maintaining a competitive 180ms streaming latency. On the VoicePrivacy 2024 protocol, our approach achieves a 49.2% UAR (emotion preservation) with 5.77% WER (intelligibility), a +24% relative UAR improvement over the baseline (39.7%->49.2%) and +10% over the emotion-prompt variant (44.6% UAR), while maintaining strong privacy (EER 49.0%). Demo and code are available: https://anonymous3842031239.github.io/

Problem

Research questions and friction points this paper is trying to address.

speaker anonymization

emotion preservation

streaming speech

paralinguistic attributes

neural audio codec

Innovation

Methods, ideas, or system contributions that make the work stand out.

emotion preservation

speaker anonymization

frame-level distillation