Audio Language Model for Deepfake Detection Grounded in Acoustic Chain-of-Thought

📅 2026-03-30

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

This work addresses the limitations of existing deepfake speech detection methods, which typically rely on black-box binary classification and struggle to integrate structured acoustic evidence—such as prosody, spectral characteristics, and physiological cues—for interpretable reasoning. To overcome this, the authors propose CoLMbo-DF, a novel framework that introduces an acoustic chain-of-thought mechanism into deepfake detection for the first time. By converting low-level acoustic features into structured textual prompts, the model guides a lightweight, open-source audio language model through a step-by-step reasoning process. Evaluated on a newly curated dataset annotated with chain-of-thought rationales, CoLMbo-DF significantly outperforms current baselines, achieving both high detection accuracy and strong interpretability despite its relatively small model size.

Technology Category

Application Category

📝 Abstract

Deepfake speech detection systems are often limited to binary classification tasks and struggle to generate interpretable reasoning or provide context-rich explanations for their decisions. These models primarily extract latent embeddings for authenticity detection but fail to leverage structured acoustic evidence such as prosodic, spectral, and physiological attributes in a meaningful manner. This paper introduces CoLMbo-DF, a Feature-Guided Audio Language Model that addresses these limitations by integrating robust deepfake detection with explicit acoustic chain-of-thought reasoning. By injecting structured textual representations of low-level acoustic features directly into the model prompt, our approach grounds the model's reasoning in interpretable evidence and improves detection accuracy. To support this framework, we introduce a novel dataset of audio pairs paired with chain-of-thought annotations. Experiments show that our method, trained on a lightweight open-source language model, significantly outperforms existing audio language model baselines despite its smaller scale, marking a significant advancement in explainable deepfake speech detection.

Problem

Research questions and friction points this paper is trying to address.

deepfake detection

audio language model

acoustic features

explainable AI

chain-of-thought reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio Language Model

Deepfake Detection

Acoustic Chain-of-Thought