VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models

📅 2026-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the critical limitation of existing speech language models (SLMs) in multi-user scenarios, where they lack speaker identity awareness and struggle to protect interaction-sensitive private information—such as personal schedules—based on contextual cues. We introduce the novel concept of “interaction privacy” and present the first benchmark specifically designed to evaluate SLMs’ privacy-preserving behaviors through a three-tiered task hierarchy, progressing from passive compliance to active reasoning. The benchmark comprises a real-speaker verification subset and 4,000 hours of multilingual, multi-speaker data combining synthetic and human-recorded utterances. We further formulate a conditional privacy decision task and propose a tailored fine-tuning strategy. Evaluations across nine prominent models reveal that most open-source SLMs perform near chance level (~50% accuracy) on conditional privacy tasks, while even closed-source systems fail at active reasoning. Fine-tuning substantially enhances privacy capabilities without compromising conversational robustness.

Technology Category

Application Category

📝 Abstract
As Speech Language Models (SLMs) transition from personal devices to shared, multi-user environments such as smart homes, a new challenge emerges: the model is expected to distinguish between users to manage information flow appropriately. Without this capability, an SLM could reveal one user's confidential schedule to another, a privacy failure we term interactional privacy. Thus, the ability to generate speaker-aware responses becomes essential for SLM safe deployment. Current SLM benchmarks test dialogue ability but overlook speaker identity. Multi-speaker benchmarks check who said what without assessing whether SLMs adapt their responses. Privacy benchmarks focus on globally sensitive data (e.g., bank passwords) while neglecting contextual privacy-sensitive information (e.g., a user's private appointment). To address this gap, we introduce VoxPrivacy, the first benchmark designed to evaluate interactional privacy in SLMs. VoxPrivacy spans three tiers of increasing difficulty, from following direct secrecy commands to proactively protecting privacy. Our evaluation of nine SLMs on a 32-hour bilingual dataset reveals a widespread vulnerability: most open-source models perform close to random chance (around 50% accuracy) on conditional privacy decisions, while even strong closed-source systems fall short on proactive privacy inference. We further validate these findings on Real-VoxPrivacy, a human-recorded subset, confirming that failures observed on synthetic data persist in real speech. Finally, we demonstrate a viable path forward: by fine-tuning on a new 4,000-hour training set, we improve privacy-preserving abilities while maintaining robustness. To support future work, we release the VoxPrivacy benchmark, the large-scale training set, and the fine-tuned model to foster the development of safer and more context-aware SLMs.
Problem

Research questions and friction points this paper is trying to address.

interactional privacy
speech language models
multi-user environments
speaker-awareness
contextual privacy
Innovation

Methods, ideas, or system contributions that make the work stand out.

interactional privacy
speaker-aware response
privacy benchmark
speech language models
contextual privacy
🔎 Similar Papers
No similar papers found.