Visual Anagrams Reveal Hidden Differences in Holistic Shape Processing Across Vision Models

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Vision models over-rely on local textures while neglecting global configural shape—i.e., the spatial arrangement of object parts. Existing evaluations treat shape and texture as opposing attributes and only measure relative bias, failing to quantify absolute configural capability. Method: We propose the Configural Shape Score (CSS), the first absolute metric for configural shape understanding, based on Object-Anagram image pairs. CSS measures a model’s ability to recognize object categories under part rearrangement while preserving local texture. We employ radius-controlled attention masking, representational similarity analysis, and BagNet ablation studies, complemented by mechanistic investigations of self-supervised and language-aligned Transformers (e.g., DINOv2, SigLIP2, EVA-CLIP). Results: High-CSS models exhibit long-range interactions, a U-shaped information integration profile, and a mid-layer transition from local to global encoding. Crucially, CSS robustly predicts performance on diverse shape-sensitive downstream tasks.

Technology Category

Application Category

📝 Abstract

Humans are able to recognize objects based on both local texture cues and the configuration of object parts, yet contemporary vision models primarily harvest local texture cues, yielding brittle, non-compositional features. Work on shape-vs-texture bias has pitted shape and texture representations in opposition, measuring shape relative to texture, ignoring the possibility that models (and humans) can simultaneously rely on both types of cues, and obscuring the absolute quality of both types of representation. We therefore recast shape evaluation as a matter of absolute configural competence, operationalized by the Configural Shape Score (CSS), which (i) measures the ability to recognize both images in Object-Anagram pairs that preserve local texture while permuting global part arrangement to depict different object categories. Across 86 convolutional, transformer, and hybrid models, CSS (ii) uncovers a broad spectrum of configural sensitivity with fully self-supervised and language-aligned transformers -- exemplified by DINOv2, SigLIP2 and EVA-CLIP -- occupying the top end of the CSS spectrum. Mechanistic probes reveal that (iii) high-CSS networks depend on long-range interactions: radius-controlled attention masks abolish performance showing a distinctive U-shaped integration profile, and representational-similarity analyses expose a mid-depth transition from local to global coding. A BagNet control remains at chance (iv), ruling out "border-hacking" strategies. Finally, (v) we show that configural shape score also predicts other shape-dependent evals. Overall, we propose that the path toward truly robust, generalizable, and human-like vision systems may not lie in forcing an artificial choice between shape and texture, but rather in architectural and learning frameworks that seamlessly integrate both local-texture and global configural shape.

Problem

Research questions and friction points this paper is trying to address.

Evaluate absolute configural competence in vision models

Measure shape recognition in Object-Anagram pairs

Assess local-texture and global configural shape integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Configural Shape Score measures absolute shape competence

Self-supervised transformers excel in configural sensitivity

Long-range interactions crucial for global shape coding

🔎 Similar Papers

Universal dimensions of visual representation