When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for detecting AI-generated images struggle to identify high-level semantic inconsistencies in human interaction scenes after localized editing, particularly when low-level visual artifacts are removed. This work introduces “social gaze consistency” as a novel semantic cue, modeling the coordination among gaze direction, head–eye alignment, and pupil positions of interacting individuals to establish a higher-order discriminative mechanism. To support this approach, we construct a paired perturbation dataset that disrupts generator fingerprint memorization and propose a block-wise compositional caption supervision strategy to decouple reasoning logic from superficial diversity, thereby enabling cross-architecture generalization. On the COCOAI Interaction and Person subsets, our method improves balanced accuracy by 3.7 (67.8→71.5) and 1.3 percentage points (83.0→84.3), respectively, with simultaneous gains in true and false class recall, demonstrating both effectiveness and non-biased performance.
📝 Abstract
Recent generative models have largely closed the gap on low-level artifacts - pixel fingerprints, frequency anomalies, upsampling traces - particularly in person-centric and partial-edit settings where the manipulated region is small and surrounded by photometrically authentic content. We introduce Social Gaze Consistency, a high-level semantic cue defined as the mutual coherence of gaze direction, head-eye alignment, and pupil placement between interacting individuals, and show that it constitutes a previously underutilized detection axis orthogonal to existing low-level paradigms. We instantiate this insight through three coupled mechanisms: (i) a controlled diagnostic dataset with region-specific perturbations of gaze-consistent imagery, where strict pair-level grouping forecloses generator-fingerprint memorization as an optimization-time shortcut rather than relying on augmentation; (ii) Block-Compositional Caption Supervision, which holds a single 5-block reasoning skeleton invariant across 1,250 macro-combined captions, decoupling reasoning consistency from surface diversity; (iii) Cross-architecture validation showing the same supervision improves a vision-language backbone (FakeVLM) by +3.7 pp on the COCOAI Interaction subset (balanced accuracy 67.8 -> 71.5) and +1.3 pp on the COCOAI Person subset (83.0 -> 84.3), with consistent gains on a vision-only backbone (Effort), evidencing a backbone-agnostic cue. Real- and fake-class recalls rise simultaneously, ruling out a "predict-all-fake" artifact. A four-step mechanistic account - paired-edit shortcut blocking, hard-to-easy difficulty transfer, CLIP prior preservation, and diffusion-family shared spectral weakness in periocular structure - explains why training on a single inpainter (FLUX.1-Fill) transfers to multi-generator suites. We will release the code upon acceptance to facilitate reproducibility.
Problem

Research questions and friction points this paper is trying to address.

AI-generated image detection
social gaze consistency
semantic cue
person-centric manipulation
high-level artifact
Innovation

Methods, ideas, or system contributions that make the work stand out.

Social Gaze Consistency
AI-generated image detection
Block-Compositional Caption Supervision
high-level semantic cue
cross-architecture validation
🔎 Similar Papers
No similar papers found.