π€ AI Summary
This study presents the first systematic investigation into whether audio large language models (ALLMs) inadvertently leak user privacy through voiceprints. To this end, we introduce HearSay, a benchmark comprising over 22,000 real-world speech samples, and employ a comprehensive evaluation framework integrating automated data profiling, human validation, chain-of-thought (CoT) reasoning analysis, and privacy attribute classification to assess ALLMsβ capacity for privacy extraction and the efficacy of existing safety mechanisms. Our findings reveal that ALLMs achieve 92.89% accuracy in gender identification and effectively model diverse sociodemographic attributes. However, they exhibit virtually no ability to refuse privacy-sensitive queries, and CoT reasoning substantially amplifies leakage risks, exposing critical deficiencies in current privacy safeguards.
π Abstract
While Audio Large Language Models (ALLMs) have achieved remarkable progress in understanding and generation, their potential privacy implications remain largely unexplored. This paper takes the first step to investigate whether ALLMs inadvertently leak user privacy solely through acoustic voiceprints and introduces $\textit{HearSay}$, a comprehensive benchmark constructed from over 22,000 real-world audio clips. To ensure data quality, the benchmark is meticulously curated through a rigorous pipeline involving automated profiling and human verification, guaranteeing that all privacy labels are grounded in factual records. Extensive experiments on $\textit{HearSay}$ yield three critical findings: $\textbf{Significant Privacy Leakage}$: ALLMs inherently extract private attributes from voiceprints, reaching 92.89% accuracy on gender and effectively profiling social attributes. $\textbf{Insufficient Safety Mechanisms}$: Alarmingly, existing safeguards are severely inadequate; most models fail to refuse privacy-intruding requests, exhibiting near-zero refusal rates for physiological traits. $\textbf{Reasoning Amplifies Risk}$: Chain-of-Thought (CoT) reasoning exacerbates privacy risks in capable models by uncovering deeper acoustic correlations. These findings expose critical vulnerabilities in ALLMs, underscoring the urgent need for targeted privacy alignment. The codes and dataset are available at https://github.com/JinWang79/HearSay_Benchmark