🤖 AI Summary
This study investigates the zero-shot detection capability of four state-of-the-art vision-language models (VLMs)—ChatGPT, Claude, Gemini, and Grok—on three deepfake image categories: face swapping, facial expression reenactment, and synthetic generation. Method: We construct a multi-source deepfake benchmark and propose a dual-dimensional evaluation framework assessing both classification accuracy and reasoning depth. Contribution/Results: All VLMs fall significantly short of practical detection accuracy compared to specialized deepfake detectors, confirming their unsuitability as standalone forensic tools. We identify novel failure modes—including stylistic bias and vintage-image interference—for the first time. Conversely, VLMs demonstrate strong contextual understanding and natural language explanation capabilities. Accordingly, we advocate repositioning VLMs as interpretable, human-in-the-loop forensic components—not autonomous detectors. Our empirical findings and methodological framework provide foundational evidence and guidance for the paradigm shift toward explainable, collaborative digital forensics using VLMs.
📝 Abstract
The growing sophistication of deepfakes presents substantial challenges to the integrity of media and the preservation of public trust. Concurrently, vision-language models (VLMs), large language models enhanced with visual reasoning capabilities, have emerged as promising tools across various domains, sparking interest in their applicability to deepfake detection. This study conducts a structured zero-shot evaluation of four prominent VLMs: ChatGPT, Claude, Gemini, and Grok, focusing on three primary deepfake types: faceswap, reenactment, and synthetic generation. Leveraging a meticulously assembled benchmark comprising authentic and manipulated images from diverse sources, we evaluate each model's classification accuracy and reasoning depth. Our analysis indicates that while VLMs can produce coherent explanations and detect surface-level anomalies, they are not yet dependable as standalone detection systems. We highlight critical failure modes, such as an overemphasis on stylistic elements and vulnerability to misleading visual patterns like vintage aesthetics. Nevertheless, VLMs exhibit strengths in interpretability and contextual analysis, suggesting their potential to augment human expertise in forensic workflows. These insights imply that although general-purpose models currently lack the reliability needed for autonomous deepfake detection, they hold promise as integral components in hybrid or human-in-the-loop detection frameworks.