🤖 AI Summary
This study investigates the alignment between self-supervised visual models and human perception in object grouping and segmentation tasks. By constructing a large-scale human behavioral benchmark—recording participants’ judgments on whether pairs of points in natural scenes belong to the same object, along with their reaction times—the authors systematically evaluate how well various model representations correspond to human perceptual judgments. They introduce a novel metric to quantify object-centric structure in model representations and propose Gram matrix distillation to enhance alignment with human data. Experiments reveal that Vision Transformers trained with DINO exhibit the closest match to human object grouping behavior, and that stronger object-centric structure, coupled with improved Gram matrix correspondence, significantly boosts the accuracy of predicting human reaction times.
📝 Abstract
Vision foundation models trained with self-supervised objectives achieve strong performance across diverse tasks and exhibit emergent object segmentation properties. However, their alignment with human object perception remains poorly understood. Here, we introduce a behavioral benchmark in which participants make same/different object judgments for dot pairs on naturalistic scenes, scaling up a classical psychophysics paradigm to over 1000 trials. We test a diverse set of vision models using a simple readout from their representations to predict subjects' reaction times. We observe a steady improvement across model generations, with both architecture and training objective contributing to alignment, and transformer-based models trained with the DINO self-supervised objective showing the strongest performance. To investigate the source of this improvement, we propose a novel metric to quantify the object-centric component of representations by measuring patch similarity within and between objects. Across models, stronger object-centric structure predicts human segmentation behavior more accurately. We further show that matching the Gram matrix of supervised transformer models, capturing similarity structure across image patches, with that of a self-supervised model through distillation improves their alignment with human behavior, converging with the prior finding that Gram anchoring improves DINOv3's feature quality. Together, these results demonstrate that self-supervised vision models capture object structure in a behaviorally human-like manner, and that Gram matrix structure plays a role in driving perceptual alignment.