From Talking to Singing: A New Challenge for Audio-Visual Deepfake Detection

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant performance degradation of existing deepfake detection methods in singing scenarios, where weakened audiovisual coupling due to rhythmic speech patterns and the absence of dedicated benchmarks pose major challenges. To bridge this gap, we introduce SHDF, the first singing-head deepfake dataset, and propose T-AVFD, a text-guided audiovisual forgery detection framework. T-AVFD leverages a rhythm-aware generative model for data construction and incorporates multi-granularity text-aligned facial authenticity modeling, adaptive multimodal discrepancy weighting, and cross-modal consistency learning to effectively mitigate domain shift between speaking and singing contexts. Extensive experiments demonstrate that T-AVFD consistently outperforms state-of-the-art methods across multiple deepfake datasets and under various perturbations, exhibiting superior robustness and generalization capability.
📝 Abstract
With rapid advances in audio-visual generative models, reliable forgery detection becomes increasingly critical. Existing methods for audio-visual deepfake detection typically rely on cross-modal inconsistencies. In singing, rhythmic vocalization weakens this coupling and introduces a nontrivial domain shift, substantially degrading detection performance. We construct the Singing Head DeepFake (SHDF) dataset using rhythm-aware generative models to fill the gap in singing benchmarks. To cope with cross-scenario domain shifts, we propose a Text-guided Audio-Visual Forgery Detection (T-AVFD) framework that generalizes across both talking and singing scenarios. T-AVFD comprises a facial authenticity pattern learner and a multi-modal differential weight learning module. The pattern learner aligns facial features with multi-granularity textual descriptions to learn generalizable authenticity patterns. The weight learning module preserves intrinsic audio-visual consistency and adaptively integrates it with authenticity patterns via differential weighting. Extensive experiments on multiple talking head deepfake datasets and SHDF show consistent improvements over existing baselines and strong robustness under diverse perturbations.
Problem

Research questions and friction points this paper is trying to address.

audio-visual deepfake detection
singing
domain shift
cross-modal inconsistency
forgery detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

audio-visual deepfake detection
singing deepfakes
text-guided learning
domain generalization
multimodal fusion
🔎 Similar Papers
K
Ke Liu
Center for Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
Jiwei Wei
Jiwei Wei
Professor at University of Electronic Science and Technology of China (UESTC)
Cross-Modal RetrievalMetric LearningAdversarial Machine LearningAIGC
W
Wenyu Zhang
Center for Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
Shuchang Zhou
Shuchang Zhou
Megvii Inc.
Artificial Intelligence
R
Ruikun Chai
Center for Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
Y
Yutao Dai
Center for Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
Chaoning Zhang
Chaoning Zhang
Professor at UESTC (电子科技大学, China)
Computer VisionLLM and VLMGenAI and AIGC Detection
Y
Yang Yang
Center for Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China