VoxSafeBench: Not Just What Is Said, but Who, How, and Where

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

216K/year
🤖 AI Summary
Current spoken language models struggle to accurately assess the safety, fairness, and privacy risks of user requests by leveraging speaker identity, paralinguistic cues, and situational context, leading to social alignment failures even when transcribed text appears benign. To address this, this work proposes VoxSafeBench—the first dual-tier benchmark for jointly evaluating the social alignment capabilities of spoken language models across safety, fairness, and privacy dimensions. The first tier assesses explicit content risks, while the second targets implicit risks dependent on acoustic context, such as prosody, speaker characteristics, and environmental factors. Evaluation across 22 multilingual, multitask datasets and intermediate probing tasks reveals a critical “voice grounding gap”: although models can perceive acoustic signals, they fail to generate appropriate responses based on them, exhibiting significant deficiencies in conditional risk detection and acoustic privacy preservation.

Technology Category

Application Category

📝 Abstract
As speech language models (SLMs) transition from personal devices into shared, multi-user environments, their responses must account for far more than the words alone. Who is speaking, how they sound, and where the conversation takes place can each turn an otherwise benign request into one that is unsafe, unfair, or privacy-violating. Existing benchmarks, however, largely focus on basic audio comprehension, study individual risks in isolation, or conflate content that is inherently harmful with content that only becomes problematic due to its acoustic context. We introduce VoxSafeBench, among the first benchmarks to jointly evaluate social alignment in SLMs across three dimensions: safety, fairness, and privacy. VoxSafeBench adopts a Two-Tier design: Tier1 evaluates content-centric risks using matched text and audio inputs, while Tier2 targets audio-conditioned risks in which the transcript is benign but the appropriate response hinges on the speaker, paralinguistic cues, or the surrounding environment. To validate Tier2, we include intermediate perception probes and confirm that frontier SLMs can successfully detect these acoustic cues yet still fail to act on them appropriately. Across 22 tasks with bilingual coverage, we find that safeguards appearing robust on text often degrade in speech: safety awareness drops for speaker- and scene-conditioned risks, fairness erodes when demographic differences are conveyed vocally, and privacy protections falter when contextual cues arrive acoustically. Together, these results expose a pervasive speech grounding gap: current SLMs frequently recognize the relevant social norm in text but fail to apply it when the decisive cue must be grounded in speech. Code and data are publicly available at: https://amphionteam.github.io/VoxSafeBench_demopage/
Problem

Research questions and friction points this paper is trying to address.

speech language models
social alignment
acoustic context
safety
fairness
Innovation

Methods, ideas, or system contributions that make the work stand out.

speech language models
social alignment
audio-conditioned risks
VoxSafeBench
speech grounding gap