HearSay Benchmark: Do Audio LLMs Leak What They Hear?

📅 2026-01-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This study presents the first systematic investigation into whether audio large language models (ALLMs) inadvertently leak user privacy through voiceprints. To this end, we introduce HearSay, a benchmark comprising over 22,000 real-world speech samples, and employ a comprehensive evaluation framework integrating automated data profiling, human validation, chain-of-thought (CoT) reasoning analysis, and privacy attribute classification to assess ALLMs’ capacity for privacy extraction and the efficacy of existing safety mechanisms. Our findings reveal that ALLMs achieve 92.89% accuracy in gender identification and effectively model diverse sociodemographic attributes. However, they exhibit virtually no ability to refuse privacy-sensitive queries, and CoT reasoning substantially amplifies leakage risks, exposing critical deficiencies in current privacy safeguards.

Technology Category

Application Category

📝 Abstract

While Audio Large Language Models (ALLMs) have achieved remarkable progress in understanding and generation, their potential privacy implications remain largely unexplored. This paper takes the first step to investigate whether ALLMs inadvertently leak user privacy solely through acoustic voiceprints and introduces $\textit{HearSay}$, a comprehensive benchmark constructed from over 22,000 real-world audio clips. To ensure data quality, the benchmark is meticulously curated through a rigorous pipeline involving automated profiling and human verification, guaranteeing that all privacy labels are grounded in factual records. Extensive experiments on $\textit{HearSay}$ yield three critical findings: $\textbf{Significant Privacy Leakage}$: ALLMs inherently extract private attributes from voiceprints, reaching 92.89% accuracy on gender and effectively profiling social attributes. $\textbf{Insufficient Safety Mechanisms}$: Alarmingly, existing safeguards are severely inadequate; most models fail to refuse privacy-intruding requests, exhibiting near-zero refusal rates for physiological traits. $\textbf{Reasoning Amplifies Risk}$: Chain-of-Thought (CoT) reasoning exacerbates privacy risks in capable models by uncovering deeper acoustic correlations. These findings expose critical vulnerabilities in ALLMs, underscoring the urgent need for targeted privacy alignment. The codes and dataset are available at https://github.com/JinWang79/HearSay_Benchmark

Problem

Research questions and friction points this paper is trying to address.

Audio Large Language Models

privacy leakage

voiceprints

HearSay Benchmark

acoustic privacy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio Large Language Models

Privacy Leakage

Voiceprint Analysis

HearSay Benchmark

Chain-of-Thought Reasoning

🔎 Similar Papers

No similar papers found.