🤖 AI Summary
Current vision foundation models lack reliable and unified evaluation standards for human interpretability, particularly in high-stakes scenarios. This work proposes the first quantifiable and comparable assessment framework that integrates psychophysical experiments—measuring localization and naming capabilities—with features extracted via sparse autoencoders and a chance-calibrated scoring mechanism based on random baselines, enabling interpretability measurement on a unified scale. Empirical analysis across six vision Transformers and over 15,000 human responses reveals that contemporary foundation models generally exhibit lower interpretability than supervised counterparts. Crucially, interpretability is not determined by overall model capability but hinges on the locality of feature activations and their alignment with coarse-grained semantic concepts.
📝 Abstract
How interpretable are the features of leading vision models? The question is increasingly pressing as these models move from research benchmarks into high-stakes deployments, yet existing methods cannot answer it reliably. We close this gap with a framework for measuring and comparing the human interpretability of vision models, built around two complementary psychophysics protocols: (1) localizability -- can an observer predict where a feature fires on a novel image? -- and (2) nameability -- can an observer accurately describe what the feature represents? Features are recovered via sparse autoencoders, and a chance-anchored scoring function places every model on a common scale. Applying the framework to six vision transformers -- two supervised ViTs and four foundation models (DINOv2, DINOv3, CLIP, SigLIP) -- we collected more than $15{,}000$ behavioral responses, analyzing the $13{,}400$ responses from the $377$ participants who passed our pre-specified quality checks. Foundation models are consistently *less* interpretable than their supervised counterparts, and the gap is not a capability tradeoff: interpretability does not correlate with downstream task performance on any benchmark we examine. What does correlate is the locality of a feature's activations and coarse-grained semantic alignment with humans -- models with focal activations and representations that reflect the world's broad categorical structure produce more interpretable features, whereas fine-grained perceptual alignment does not. The two protocols yield strongly correlated rankings and share the same predictors, establishing interpretability as an independent, measurable dimension of representation quality -- and, surprisingly, one on which every foundation model we tested falls below the supervised baselines that came before. Capability alone cannot close that gap; locality and coarse-grained alignment can.