Towards Explicit Acoustic Evidence Perception in Audio LLMs for Speech Deepfake Detection

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work addresses the limitation of existing audio large language models (LLMs) in deepfake speech detection, which overly rely on semantic cues and struggle to identify synthetically generated utterances that are semantically coherent yet contain subtle acoustic anomalies. To overcome this, we propose SDD-APALLM, a novel framework that explicitly models fine-grained time-frequency acoustic evidence within an audio LLM architecture. By jointly integrating raw waveforms and structured spectrograms, our approach constructs an auditory-perception-enhanced model that enables synergistic coordination—rather than simple fusion—between semantic understanding and acoustic artifact perception. Extensive experiments demonstrate that SDD-APALLM achieves significant improvements in detection accuracy and robustness across multiple benchmarks, particularly excelling in scenarios where synthetic speech exhibits deceptive semantic plausibility, thereby validating the efficacy of the proposed acoustic-semantic collaborative mechanism.

Technology Category

Application Category

📝 Abstract

Speech deepfake detection (SDD) focuses on identifying whether a given speech signal is genuine or has been synthetically generated. Existing audio large language model (LLM)-based methods excel in content understanding; however, their predictions are often biased toward semantically correlated cues, which results in fine-grained acoustic artifacts being overlooked during the decisionmaking process. Consequently, fake speech with natural semantics can bypass detectors despite harboring subtle acoustic anomalies; this suggests that the challenge stems not from the absence of acoustic data, but from its inadequate accessibility when semantic-dominant reasoning prevails. To address this issue, we investigate SDD within the audio LLM paradigm and introduce SDD with Auditory Perception-enhanced Audio Large Language Model (SDD-APALLM), an acoustically enhanced framework designed to explicitly expose fine-grained time-frequency evidence as accessible acoustic cues. By combining raw audio with structured spectrograms, the proposed framework empowers audio LLMs to more effectively capture subtle acoustic inconsistencies without compromising their semantic understanding. Experimental results indicate consistent gains in detection accuracy and robustness, especially in cases where semantic cues are misleading. Further analysis reveals that these improvements stem from a coordinated utilization of semantic and acoustic information, as opposed to simple modality aggregation.

Problem

Research questions and friction points this paper is trying to address.

Speech Deepfake Detection

Audio LLMs

Acoustic Evidence

Semantic Bias

Time-Frequency Artifacts

Innovation

Methods, ideas, or system contributions that make the work stand out.

audio LLM

speech deepfake detection

acoustic evidence perception