🤖 AI Summary
Existing deepfake detection methods are hindered by modality fragmentation and shallow cross-modal semantic reasoning, limiting their effectiveness against diverse and adversarial multimodal forgeries. This work proposes ConLLM, a two-stage framework that first leverages pretrained models to extract modality-specific embeddings and then aligns multimodal representations through contrastive learning, followed by fine-grained semantic inconsistency reasoning via a large language model (LLM). To the best of our knowledge, this is the first approach to integrate LLMs with contrastive learning for multimodal deepfake detection, substantially mitigating modality fragmentation and enhancing deep cross-modal semantic understanding. Experimental results demonstrate significant improvements: up to a 50% reduction in Equal Error Rate (EER) on audio deepfake tasks, up to 8% higher accuracy in video detection, and approximately 9% improvement in audio-visual joint detection accuracy.
📝 Abstract
The rapid rise of deepfake technology poses a severe threat to social and political stability by enabling hyper-realistic synthetic media capable of manipulating public perception. However, existing detection methods struggle with two core limitations: (1) modality fragmentation, which leads to poor generalization across diverse and adversarial deepfake modalities; and (2) shallow inter-modal reasoning, resulting in limited detection of fine-grained semantic inconsistencies. To address these, we propose ConLLM (Contrastive Learning with Large Language Models), a hybrid framework for robust multimodal deepfake detection. ConLLM employs a two-stage architecture: stage 1 uses Pre-Trained Models (PTMs) to extract modality-specific embeddings; stage 2 aligns these embeddings via contrastive learning to mitigate modality fragmentation, and refines them using LLM-based reasoning to address shallow inter-modal reasoning by capturing semantic inconsistencies. ConLLM demonstrates strong performance across audio, video, and audio-visual modalities. It reduces audio deepfake EER by up to 50%, improves video accuracy by up to 8%, and achieves approximately 9% accuracy gains in audio-visual tasks. Ablation studies confirm that PTM-based embeddings contribute 9%-10% consistent improvements across modalities.