🤖 AI Summary
This work addresses the limitations of existing deepfake speech detection methods, which typically rely on black-box binary classification and struggle to integrate structured acoustic evidence—such as prosody, spectral characteristics, and physiological cues—for interpretable reasoning. To overcome this, the authors propose CoLMbo-DF, a novel framework that introduces an acoustic chain-of-thought mechanism into deepfake detection for the first time. By converting low-level acoustic features into structured textual prompts, the model guides a lightweight, open-source audio language model through a step-by-step reasoning process. Evaluated on a newly curated dataset annotated with chain-of-thought rationales, CoLMbo-DF significantly outperforms current baselines, achieving both high detection accuracy and strong interpretability despite its relatively small model size.
📝 Abstract
Deepfake speech detection systems are often limited to binary classification tasks and struggle to generate interpretable reasoning or provide context-rich explanations for their decisions. These models primarily extract latent embeddings for authenticity detection but fail to leverage structured acoustic evidence such as prosodic, spectral, and physiological attributes in a meaningful manner. This paper introduces CoLMbo-DF, a Feature-Guided Audio Language Model that addresses these limitations by integrating robust deepfake detection with explicit acoustic chain-of-thought reasoning. By injecting structured textual representations of low-level acoustic features directly into the model prompt, our approach grounds the model's reasoning in interpretable evidence and improves detection accuracy. To support this framework, we introduce a novel dataset of audio pairs paired with chain-of-thought annotations. Experiments show that our method, trained on a lightweight open-source language model, significantly outperforms existing audio language model baselines despite its smaller scale, marking a significant advancement in explainable deepfake speech detection.