🤖 AI Summary
This work addresses the insufficient abstract invariance of Large Vision-Language Models (LVLMs) in shape versus texture/material recognition. To this end, we introduce LAS&T—the first 2D/3D joint benchmark covering multi-view, multi-texture, and multi-environment scenarios. Methodologically, we propose a synthetic data generation technique that automatically extracts shape and texture priors from real-world images, establishing a cross-modal visual invariance evaluation framework grounded in matching accuracy and augmented with multi-perturbation controlled experiments. Key findings reveal that LVLMs exhibit significantly lower shape-matching accuracy than humans—plummeting sharply with viewpoint or environmental changes—while achieving human-level performance (>92%) in 3D material recognition. In contrast, their 2D texture recognition lags behind humans by an average of 27 percentage points, exposing a novel “strong 3D-material, weak 2D-texture” cognitive bias. The dataset and evaluation toolkit are publicly released.
📝 Abstract
Shape and texture recognition is fundamental to visual perception. The ability to identify shapes regardless of orientation, texture, or context, and to recognize textures independently of their associated objects, is essential for general visual understanding of the world. We introduce the Large Shape&Textures dataset (LAS&T), a giant collection of diverse shapes and textures automatically extracted from real-world images. This dataset is used to evaluate how effectively leading Large Vision-Language Models (LVLMs) understand shapes, textures, and materials in both 2D and 3D scenes. For shape recognition, we test models' ability to match identical shapes that differ in orientation, texture, color, or environment. Our results show that LVLMs' shape identification capabilities remain significantly below human performance. Single alterations (orientation, texture) cause minor decreases in matching accuracy, while multiple changes precipitate dramatic drops. LVLMs appear to rely predominantly on high-level and semantic features and struggle with abstract shapes lacking clear class associations. For texture and material recognition, we evaluate models' ability to identify identical textures and materials across different objects and environments. Interestingly, leading LVLMs approach human-level performance in recognizing materials in 3D scenes, yet substantially underperform humans when identifying simpler 2D textures. The LAS&T dataset and benchmark, the largest and most diverse resource for shape and texture evaluation, is freely available with generation and testing scripts.