🤖 AI Summary
End-to-end audio language models (Audio LMs) pose significant security and privacy risks during deployment, particularly through unintended exposure of sensitive acoustic attributes—such as speaker identity—that may enable misuse or violate regulatory requirements.
Method: This work pioneers the application of the Principle of Least Privilege (PoLP) to Audio LM deployment design. We propose a dual-dimensional “Necessity–Permission Boundary” evaluation framework, integrating security engineering and AI governance perspectives. Through architectural comparison, sensitive attribute identification, and systematic benchmark defect analysis, we assess prevailing evaluation protocols.
Contribution/Results: We reveal fundamental gaps in mainstream audio benchmarks regarding privacy preservation and privilege control—specifically their failure to enforce minimal acoustic feature exposure. The study delivers an actionable, cross-disciplinary assessment methodology and identifies critical research directions for aligning technical design with policy compliance in Audio LM development and deployment.
📝 Abstract
We are at a turning point for language models that accept audio input. The latest end-to-end audio language models (Audio LMs) process speech directly instead of relying on a separate transcription step. This shift preserves detailed information, such as intonation or the presence of multiple speakers, that would otherwise be lost in transcription. However, it also introduces new safety risks, including the potential misuse of speaker identity cues and other sensitive vocal attributes, which could have legal implications. In this position paper, we urge a closer examination of how these models are built and deployed. We argue that the principle of least privilege should guide decisions on whether to deploy cascaded or end-to-end models. Specifically, evaluations should assess (1) whether end-to-end modeling is necessary for a given application; and (2), the appropriate scope of information access. Finally, We highlight related gaps in current audio LM benchmarks and identify key open research questions, both technical and policy-related, that must be addressed to enable the responsible deployment of end-to-end Audio LMs.