From Talking to Singing: A New Challenge for Audio-Visual Deepfake Detection

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

278K/year

🤖 AI Summary

This work addresses the significant performance degradation of existing deepfake detection methods in singing scenarios, where weakened audiovisual coupling due to rhythmic speech patterns and the absence of dedicated benchmarks pose major challenges. To bridge this gap, we introduce SHDF, the first singing-head deepfake dataset, and propose T-AVFD, a text-guided audiovisual forgery detection framework. T-AVFD leverages a rhythm-aware generative model for data construction and incorporates multi-granularity text-aligned facial authenticity modeling, adaptive multimodal discrepancy weighting, and cross-modal consistency learning to effectively mitigate domain shift between speaking and singing contexts. Extensive experiments demonstrate that T-AVFD consistently outperforms state-of-the-art methods across multiple deepfake datasets and under various perturbations, exhibiting superior robustness and generalization capability.

📝 Abstract

With rapid advances in audio-visual generative models, reliable forgery detection becomes increasingly critical. Existing methods for audio-visual deepfake detection typically rely on cross-modal inconsistencies. In singing, rhythmic vocalization weakens this coupling and introduces a nontrivial domain shift, substantially degrading detection performance. We construct the Singing Head DeepFake (SHDF) dataset using rhythm-aware generative models to fill the gap in singing benchmarks. To cope with cross-scenario domain shifts, we propose a Text-guided Audio-Visual Forgery Detection (T-AVFD) framework that generalizes across both talking and singing scenarios. T-AVFD comprises a facial authenticity pattern learner and a multi-modal differential weight learning module. The pattern learner aligns facial features with multi-granularity textual descriptions to learn generalizable authenticity patterns. The weight learning module preserves intrinsic audio-visual consistency and adaptively integrates it with authenticity patterns via differential weighting. Extensive experiments on multiple talking head deepfake datasets and SHDF show consistent improvements over existing baselines and strong robustness under diverse perturbations.

Problem

Research questions and friction points this paper is trying to address.

audio-visual deepfake detection

singing

domain shift

cross-modal inconsistency

forgery detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

audio-visual deepfake detection

singing deepfakes

text-guided learning