Non-identifiability of Explanations from Model Behavior in Deep Networks of Image Authenticity Judgments

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether attribution explanations from deep neural networks align with human cognitive mechanisms in judging image realism. The authors fit lightweight regression heads on multiple frozen pretrained vision models—including EfficientNetB3 and Barlow Twins—and generate attribution maps using Grad-CAM, LIME, and multi-scale pixel masking to systematically evaluate explanation consistency and robustness across architectures. Despite comparable predictive performance in estimating human realism ratings, the models exhibit substantial divergence in their attributions. Notably, VGG relies predominantly on image quality cues rather than semantic authenticity. To address this, the work proposes an ensemble strategy that achieves 80% of the noise ceiling in prediction accuracy while significantly enhancing attribution reliability and interpretability.
📝 Abstract
Deep neural networks can predict human judgments, but this does not imply that they rely on human-like information or reveal the cues underlying those judgments. Prior work has addressed this issue using attribution heatmaps, but their explanatory value in itself depends on robustness. Here we tested the robustness of such explanations by evaluating whether models that predict human authenticity ratings also produce consistent explanations within and across architectures. We fit lightweight regression heads to multiple frozen pretrained vision models and generated attribution maps using Grad-CAM, LIME, and multiscale pixel masking. Several architectures predicted ratings well, reaching about 80% of the noise ceiling. VGG models achieved this by tracking image quality rather than authenticity-specific variance, limiting the relevance of their attributions. Among the remaining models, attribution maps were generally stable across random seeds within an architecture, especially for EfficientNetB3 and Barlow Twins, and consistency was higher for images judged as more authentic. Crucially, agreement in attribution across architectures was weak even when predictive performance was similar. To address this, we combined models in ensembles, which improved prediction of human authenticity judgments and enabled image-level attribution via pixel masking. We conclude that while deep networks can predict human authenticity judgments well, they do not produce identifiable explanations for those judgments. More broadly, our findings suggest that post hoc explanations from successful models of behavior should be treated as weak evidence for cognitive mechanism.
Problem

Research questions and friction points this paper is trying to address.

non-identifiability
attribution maps
deep neural networks
image authenticity
explanation robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

attribution robustness
model interpretability
ensemble explanation
image authenticity judgment
cross-architecture consistency
🔎 Similar Papers
No similar papers found.
I
Icaro Re Depaolini
Center for Mind/Brain Sciences, The University of Trento, Trento, Italy
Uri Hasson
Uri Hasson
Professor of Psychology and Neuroscience
Social and cognitive neuroscience