Interpretable All-Type Audio Deepfake Detection with Audio LLMs via Frequency-Time Reinforcement Learning

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of generalizing audio deepfake detection across diverse audio types—including speech, environmental sounds, singing, and music—where existing methods struggle to balance performance and interpretability. The authors propose a two-stage training framework based on Audio Large Language Models (ALLMs). First, they construct interpretable supervision signals via a frequency-time structured chain-of-thought (CoT) with automatic annotation. Subsequently, they perform reinforcement fine-tuning by integrating supervised fine-tuning (SFT) with a novel Frequency-Time Grouped Relative Policy Optimization (FT-GRPO). The resulting model achieves state-of-the-art performance across all audio forgery detection tasks while generating human-interpretable reasoning grounded in frequency-time features, effectively mitigating reward hacking and hallucination issues.

Technology Category

Application Category

📝 Abstract
Recent advances in audio large language models (ALLMs) have made high-quality synthetic audio widely accessible, increasing the risk of malicious audio deepfakes across speech, environmental sounds, singing voice, and music. Real-world audio deepfake detection (ADD) therefore requires all-type detectors that generalize across heterogeneous audio and provide interpretable decisions. Given the strong multi-task generalization ability of ALLMs, we first investigate their performance on all-type ADD under both supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). However, SFT using only binary real/fake labels tends to reduce the model to a black-box classifier, sacrificing interpretability. Meanwhile, vanilla RFT under sparse supervision is prone to reward hacking and can produce hallucinated, ungrounded rationales. To address this, we propose an automatic annotation and polishing pipeline that constructs Frequency-Time structured chain-of-thought (CoT) rationales, producing ~340K cold-start demonstrations. Building on CoT data, we propose Frequency Time-Group Relative Policy Optimization (FT-GRPO), a two-stage training paradigm that cold-starts ALLMs with SFT and then applies GRPO under rule-based frequency-time constraints. Experiments demonstrate that FT-GRPO achieves state-of-the-art performance on all-type ADD while producing interpretable, FT-grounded rationales. The data and code are available online.
Problem

Research questions and friction points this paper is trying to address.

audio deepfake detection
all-type audio
interpretability
audio large language models
frequency-time analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio LLMs
Frequency-Time CoT
FT-GRPO
All-Type Deepfake Detection
Interpretable AI
🔎 Similar Papers
No similar papers found.
Yuankun Xie
Yuankun Xie
PhD Candidate, Communication University of China
Audio Deepfake DetectionDomain GeneralizationOut-of-Distribution DetectionNeural Audio Codec
X
Xiaoxuan Guo
Communication University of China, Beijing, China
J
Jiayi Zhou
Machine Intelligence, Ant Group, Shanghai, China
T
Tao Wang
Machine Intelligence, Ant Group, Shanghai, China
J
Jian Liu
Machine Intelligence, Ant Group, Shanghai, China
Ruibo Fu
Ruibo Fu
Associate Professor,CASIA
AIGCLMMIntelligent speech interactionDeepfake detection
Xiaopeng Wang
Xiaopeng Wang
Institute of Automation, Chinese Academy of Sciences
Fake Audio DetectionText To SpeechSpeech Large Model
H
Haonan Cheng
Communication University of China, Beijing, China
Long Ye
Long Ye
Communication University of China
Multimedia Signal ProcessingArtificial Intelligence