Non-identifiability of Explanations from Model Behavior in Deep Networks of Image Authenticity Judgments

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This study investigates whether attribution explanations from deep neural networks align with human cognitive mechanisms in judging image realism. The authors fit lightweight regression heads on multiple frozen pretrained vision models—including EfficientNetB3 and Barlow Twins—and generate attribution maps using Grad-CAM, LIME, and multi-scale pixel masking to systematically evaluate explanation consistency and robustness across architectures. Despite comparable predictive performance in estimating human realism ratings, the models exhibit substantial divergence in their attributions. Notably, VGG relies predominantly on image quality cues rather than semantic authenticity. To address this, the work proposes an ensemble strategy that achieves 80% of the noise ceiling in prediction accuracy while significantly enhancing attribution reliability and interpretability.

Technology Category

Application Category

📝 Abstract

Deep neural networks can predict human judgments, but this does not imply that they rely on human-like information or reveal the cues underlying those judgments. Prior work has addressed this issue using attribution heatmaps, but their explanatory value in itself depends on robustness. Here we tested the robustness of such explanations by evaluating whether models that predict human authenticity ratings also produce consistent explanations within and across architectures. We fit lightweight regression heads to multiple frozen pretrained vision models and generated attribution maps using Grad-CAM, LIME, and multiscale pixel masking. Several architectures predicted ratings well, reaching about 80% of the noise ceiling. VGG models achieved this by tracking image quality rather than authenticity-specific variance, limiting the relevance of their attributions. Among the remaining models, attribution maps were generally stable across random seeds within an architecture, especially for EfficientNetB3 and Barlow Twins, and consistency was higher for images judged as more authentic. Crucially, agreement in attribution across architectures was weak even when predictive performance was similar. To address this, we combined models in ensembles, which improved prediction of human authenticity judgments and enabled image-level attribution via pixel masking. We conclude that while deep networks can predict human authenticity judgments well, they do not produce identifiable explanations for those judgments. More broadly, our findings suggest that post hoc explanations from successful models of behavior should be treated as weak evidence for cognitive mechanism.

Problem

Research questions and friction points this paper is trying to address.

non-identifiability

attribution maps

deep neural networks

image authenticity

explanation robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

attribution robustness

model interpretability

ensemble explanation