🤖 AI Summary
Current deep visual models exhibit an overreliance on non-shape cues such as texture and background for object recognition, leading to poor shape invariance under variations in 3D viewpoint and appearance. To address this limitation, this work proposes a fine-grained embedding space evaluation paradigm centered on 3D shape similarity. The authors construct a benchmark comprising 68,200 multi-view grayscale renderings and employ multidimensional analyses—including nearest-neighbor matching, viewpoint tuning curves, and ordered matching grids—to systematically assess models’ capacity for shape-based clustering. Experiments across 321 pretrained models reveal a widespread deficiency in shape understanding, with most models failing to achieve robust cross-viewpoint shape recognition.
📝 Abstract
Object recognition (OR) in humans relies heavily on shape cues and the ability to recognize objects across varying 3D viewpoints. Unlike humans, deep networks often rely on non-shape cues such as texture and background, leading to vulnerabilities in generalization and robustness. To address this gap, we introduce ShapeY, a novel and principled benchmarking framework designed to evaluate shape-based recognition capability in OR systems. ShapeY comprises 68,200 grayscale images of 200 3D objects rendered from multiple viewpoints and optionally subjected to non-shape ``appearance'' changes. Using a nearest-neighbor matching task, ShapeY specifically probes the fine-grained structure of an OR system's embedding space by evaluating whether object views are clustered by 3D shape similarity across varying 3D viewpoints and other non-shape changes. ShapeY provides a suite of quantitative and qualitative performance readouts, including error rate graphs, viewpoint tuning curves, histograms of positive and negative matching scores, and grids showing ordered best matches, which together offer a comprehensive evaluation of an OR system's shape understanding capability. Testing of 321 pre-trained networks with diverse architectures reveals significant challenges in achieving robust shape-based recognition: even state-of-the-art models struggle to generalize consistently across 3D viewpoint and appearance changes, and are prone to infrequent but egregious matches of objects of obviously completely different shape. ShapeY establishes a principled framework for advancing artificial vision systems toward human-like shape recognition capabilities, emphasizing the importance of disentangled and invariant object encodings.