Beyond Semantics: Uncovering the Physics of Fakes via Universal Physical Descriptors for Cross-Modal Synthetic Detection

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

This work addresses the limited generalization of existing deepfake detection methods when confronted with diverse generative models. The authors propose a robust detection framework grounded in low-level physical features, systematically identifying five universal physical descriptors—such as Laplacian variance and Sobel statistics—that exhibit consistency across datasets and generative architectures. These descriptors are encoded into textual signals via feature selection and integrated into the CLIP multimodal model, enabling effective fusion of pixel-level physical cues with high-level semantic understanding. The resulting approach achieves state-of-the-art performance on multiple GenImage benchmarks, attaining 99.8% accuracy on datasets including Wukong and Stable Diffusion v1.4, thereby significantly enhancing the robustness of AI-generated image detection.

Technology Category

Application Category

📝 Abstract

The rapid advancement of AI generated content (AIGC) has blurred the boundaries between real and synthetic images, exposing the limitations of existing deepfake detectors that often overfit to specific generative models. This adaptability crisis calls for a fundamental reexamination of the intrinsic physical characteristics that distinguish natural from AI-generated images. In this paper, we address two critical research questions: (1) What physical features can stably and robustly discriminate AI generated images across diverse datasets and generative architectures? (2) Can these objective pixel-level features be integrated into multimodal models like CLIP to enhance detection performance while mitigating the unreliability of language-based information? To answer these questions, we conduct a comprehensive exploration of 15 physical features across more than 20 datasets generated by various GANs and diffusion models. We propose a novel feature selection algorithm that identifies five core physical features including Laplacian variance, Sobel statistics, and residual noise variance that exhibit consistent discriminative power across all tested datasets. These features are then converted into text encoded values and integrated with semantic captions to guide image text representation learning in CLIP. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple Genimage benchmarks, with near-perfect accuracy (99.8%) on datasets such as Wukong and SDv1.4. By bridging pixel level authenticity with semantic understanding, this work pioneers the use of physically grounded features for trustworthy vision language modeling and opens new directions for mitigating hallucinations and textual inaccuracies in large multimodal models.

Problem

Research questions and friction points this paper is trying to address.

deepfake detection

AI-generated content

physical descriptors

cross-modal synthetic detection

generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

physical descriptors

cross-modal detection

CLIP

deepfake detection

feature selection

🔎 Similar Papers

FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models

2024-10-03arXiv.orgCitations: 14