Privacy-Preserving End-to-End Full-Duplex Speech Dialogue Models

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This study addresses the underexplored privacy risks in full-duplex end-to-end spoken dialogue systems, where speaker identity can be inadvertently leaked through internal representations during continuous interaction. Following the VoicePrivacy 2024 evaluation protocol, this work is the first to systematically uncover cross-layer and cross-turn patterns of speaker identity leakage in state-of-the-art models such as SALM-Duplex and Moshi. To mitigate this risk, two low-latency streaming anonymization methods are proposed: waveform-to-waveform front-end anonymization (Anon-W2W) and waveform-to-feature domain replacement (Anon-W2F). Experimental results demonstrate that Anon-W2F elevates speaker verification equal error rate (EER) from 11.2% to 41.0%, approaching random guessing performance, while Anon-W2W achieves response latency below 0.8 seconds and preserves 78–93% of sBERT-based semantic fidelity.

Technology Category

Application Category

📝 Abstract

End-to-end full-duplex speech models feed user audio through an always-on LLM backbone, yet the speaker privacy implications of their hidden representations remain unexamined. Following the VoicePrivacy 2024 protocol with a lazy-informed attacker, we show that the hidden states of SALM-Duplex and Moshi leak substantial speaker identity across all transformer layers. Layer-wise and turn-wise analyses reveal that leakage persists across all layers, with SALM-Duplex showing stronger leakage in early layers while Moshi leaks uniformly, and that Linkability rises sharply within the first few turns. We propose two streaming anonymization setups using Stream-Voice-Anon: a waveform-level front-end (Anon-W2W) and a feature-domain replacement (Anon-W2F). Anon-W2F raises EER by over 3.5x relative to the discrete encoder baseline (11.2% to 41.0%), approaching the 50% random-chance ceiling, while Anon-W2W retains 78-93% of baseline sBERT across setups with sub-second response latency (FRL under 0.8 s).

Problem

Research questions and friction points this paper is trying to address.

speaker privacy

full-duplex speech

voice anonymization

hidden representations

end-to-end speech models

Innovation

Methods, ideas, or system contributions that make the work stand out.

privacy-preserving speech processing

full-duplex dialogue models

speaker anonymization