🤖 AI Summary
Large audio-language models (LALMs) unintentionally capture bystander speech in multi-speaker scenarios, posing serious privacy risks—yet existing benchmarks and defenses lack systematic attention to this issue. We formally define the “bystander privacy protection” problem and propose a selective auditory mechanism. To rigorously evaluate it, we introduce SH-Bench—the first dedicated benchmark—and the unified metric Selective Efficacy, which jointly quantifies a model’s comprehension of the primary speaker and suppression of bystander information. Furthermore, we design Bystander Privacy Fine-Tuning (BPFT), leveraging both real and synthetic multi-speaker data alongside a 77K-question multiple-choice test to enhance model rejection of bystander-related queries. Experiments reveal significant bystander privacy leakage across mainstream LALMs on SH-Bench; BPFT improves Selective Efficacy by up to 15.9%, establishing a learnable, evaluable paradigm for privacy-aware audio foundation models.
📝 Abstract
Large audio language models (LALMs) are increasingly deployed in real-world settings where they inevitably capture speech from unintended nearby bystanders, raising privacy risks that existing benchmarks and defences largely overlook. We introduce SH-Bench, the first benchmark designed to evaluate selective hearing: a model's ability to attend to an intended main speaker while refusing to process or reveal information about incidental bystander speech. SH-Bench contains 3,968 multi-speaker audio mixtures spanning both real-world and synthetic scenarios, paired with 77k multiple-choice questions that probe models under general and selective operating modes. We propose Selective Efficacy (SE), a unified metric capturing both multi-speaker comprehension and bystander-privacy protection. Our evaluation of state-of-the-art open-source and proprietary LALMs reveals substantial privacy leakage, with strong audio understanding failing to translate into selective protection of bystander privacy. To mitigate this gap, we introduce Bystander Privacy Fine-Tuning (BPFT), a training pipeline that teaches models to refuse bystander-related queries without degrading main-speaker comprehension. BPFT yields substantial gains which improve SE by up to 15.9% over Gemini 2.5 Pro, demonstrating that selective hearing is learnable but far from achieved in current LALMs. SH-Bench and BPFT provide the first systematic framework for measuring and improving bystander privacy in audio foundation models.