🤖 AI Summary
Existing deepfake detection methods suffer significant performance degradation on multi-face videos, primarily due to their inability to model critical contextual cues inherent in social scenes. Inspired by human visual cognition, this work conducts systematic psychological experiments—first identifying four discriminative cues: scene motion coherence, facial appearance compatibility, inter-personal gaze alignment, and face-body consistency. Based on these findings, we propose an interpretable and generalizable multi-face deepfake detection framework that jointly models multimodal features and leverages large language models (LLMs) to generate human-readable decision rationales. Evaluated on standard benchmarks, our method achieves a 3.3% improvement in in-distribution accuracy, a 2.8% gain under realistic perturbations, and outperforms state-of-the-art approaches by 5.8% in cross-dataset generalization—demonstrating substantially enhanced robustness and decision transparency.
📝 Abstract
Multi-face deepfake videos are becoming increasingly prevalent, often appearing in natural social settings that challenge existing detection methods. Most current approaches excel at single-face detection but struggle in multi-face scenarios, due to a lack of awareness of crucial contextual cues. In this work, we develop a novel approach that leverages human cognition to analyze and defend against multi-face deepfake videos. Through a series of human studies, we systematically examine how people detect deepfake faces in social settings. Our quantitative analysis reveals four key cues humans rely on: scene-motion coherence, inter-face appearance compatibility, interpersonal gaze alignment, and face-body consistency. Guided by these insights, we introduce extsf{HICOM}, a novel framework designed to detect every fake face in multi-face scenarios. Extensive experiments on benchmark datasets show that extsf{HICOM} improves average accuracy by 3.3% in in-dataset detection and 2.8% under real-world perturbations. Moreover, it outperforms existing methods by 5.8% on unseen datasets, demonstrating the generalization of human-inspired cues. extsf{HICOM} further enhances interpretability by incorporating an LLM to provide human-readable explanations, making detection results more transparent and convincing. Our work sheds light on involving human factors to enhance defense against deepfakes.