Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models

πŸ“… 2026-01-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work proposes the first streaming speaker anonymization framework for real-time speech applications that integrates a neural audio codec (NAC) with a causal language model (LM), addressing the dual challenges of low latency and privacy preservation. By leveraging pseudo-speaker representation sampling, speaker embedding mixing, and diverse prompt selection, the framework enables flexible trade-offs between privacy and latency under either dynamic or fixed delay constraints. The disentangled nature of quantized content codes effectively suppresses speaker information leakage. Evaluated under the VoicePrivacy 2024 protocol, the method reduces word error rate by 46% and improves unweighted average recall (UAR) for emotion recognition by 28% compared to DarkStream, while achieving an end-to-end latency of only 180 ms and offering strong privacy guarantees against lazy adversaries.

Technology Category

Application Category

πŸ“ Abstract
Protecting speaker identity is crucial for online voice applications, yet streaming speaker anonymization (SA) remains underexplored. Recent research has demonstrated that neural audio codec (NAC) provides superior speaker feature disentanglement and linguistic fidelity. NAC can also be used with causal language models (LM) to enhance linguistic fidelity and prompt control for streaming tasks. However, existing NAC-based online LM systems are designed for voice conversion (VC) rather than anonymization, lacking the techniques required for privacy protection. Building on these advances, we present Stream-Voice-Anon, which adapts modern causal LM-based NAC architectures specifically for streaming SA by integrating anonymization techniques. Our anonymization approach incorporates pseudo-speaker representation sampling, a speaker embedding mixing and diverse prompt selection strategies for LM conditioning that leverage the disentanglement properties of quantized content codes to prevent speaker information leakage. Additionally, we compare dynamic and fixed delay configurations to explore latency-privacy trade-offs in real-time scenarios. Under the VoicePrivacy 2024 Challenge protocol, Stream-Voice-Anon achieves substantial improvements in intelligibility (up to 46% relative WER reduction) and emotion preservation (up to 28% UAR relative) compared to the previous state-of-the-art streaming method DarkStream while maintaining comparable latency (180ms vs 200ms) and privacy protection against lazy-informed attackers, though showing 15% relative degradation against semi-informed attackers.
Problem

Research questions and friction points this paper is trying to address.

speaker anonymization
streaming
privacy protection
real-time voice applications
speaker identity
Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming speaker anonymization
neural audio codec
causal language model
speaker disentanglement
real-time privacy
πŸ”Ž Similar Papers
No similar papers found.
N
Nikita Kuzmin
Nanyang Technological University, Singapore; Institute for Infocomm Research, A⋆STAR, Singapore
S
Songting Liu
Nanyang Technological University, Singapore
Kong Aik Lee
Kong Aik Lee
The Hong Kong Polytechnic University, Hong Kong
Speaker and Spoken Language RecognitionSpeech ProcessingDigital Signal ProcessingSubband
E
E. Chng
Nanyang Technological University, Singapore