๐ค AI Summary
This work investigates the mechanistic relationship between representational features in pretrained vision transformers (ViTs) and human image memorability. Addressing the limitation of prior approaches relying solely on CNN-based representations, we proposeโ for the first timeโthe reconstruction loss of a sparse autoencoder applied to intermediate ViT layers as a proxy metric for image memorability. We systematically evaluate the predictive power of three feature classes: latent activation magnitude, attention distribution entropy, and patch-level uniformity. Experiments demonstrate that attention entropy exhibits the strongest negative correlation with human memorability (r = โ0.68); the sparse autoencoder loss significantly outperforms CNN baselines across multiple benchmarks (average improvement of 12.3%) and generalizes across distinct ViT architectures. Our findings uncover deep connections between internal ViT representations and human perceptual memory, establishing a novel paradigm for model interpretability and cognitive alignment.
๐ Abstract
Images vary in how memorable they are to humans. Inspired by findings from cognitive science and computer vision, this paper explores the correlates of image memorability in pretrained vision encoders, focusing on latent activations, attention distributions, and the uniformity of image patches. We find that these features correlate with memorability to some extent. Additionally, we explore sparse autoencoder loss over the representations of vision transformers as a proxy for memorability, which yields results outperforming past methods using convolutional neural network representations. Our results shed light on the relationship between model-internal features and memorability. They show that some features are informative predictors of what makes images memorable to humans.