🤖 AI Summary
This study investigates whether Gram matrix–based texture representations in convolutional neural networks (CNNs) align with human texture perception and whether such alignment improves as CNNs become better models of the visual system. By integrating human psychophysical data with Brain-Score neural benchmarking, the authors conduct a cross-model comparative analysis across diverse CNN architectures. The findings reveal that, regardless of a model’s performance in object recognition or neural predictivity, its texture representations consistently fail to accurately capture key aspects of human perceptual judgments. This work demonstrates, for the first time, no significant association between a CNN’s general visual modeling capability and its fidelity in representing human texture perception, thereby challenging the prevailing assumption that object recognition training inherently yields human-like texture representations. The results suggest that human texture perception likely relies on contextual integration mechanisms beyond local feature correlations encoded by standard CNNs.
📝 Abstract
Mathematical modeling of visual textures traces back to Julesz's intuition that texture perception in humans is based on local correlations between image features. An influential approach for texture analysis and generation generalizes this notion to linear correlations between the nonlinear features computed by convolutional neural networks (CNNs), compiled into Gram matrices. Given that CNNs are often used as models for the visual system, it is natural to ask whether such "texture representations" spontaneously align with the textures' perceptual content, and in particular whether those CNNs that are regarded as better models for the visual system also possess more human-like texture representations. Here we compare the perceptual content captured by feature correlations computed for a diverse pool of CNNs, and we compare it to the models' perceptual alignment with the mammalian visual system as measured by Brain-Score. Surprisingly, we find that there is no connection between conventional measures of CNN quality as a model of the visual system and its alignment with human texture perception. We conclude that texture perception involves mechanisms that are distinct from those that are commonly modeled using approaches based on CNNs trained on object recognition, possibly depending on the integration of contextual information.