🤖 AI Summary
This study investigates whether deep learning models can learn human-like, hierarchical 3D shape representations from sparse point clouds, particularly examining whether global shape recognition is fundamentally constrained by sparsity. Method: We propose and validate a hierarchical abstraction mechanism that models the progression from local geometry to global structure using a point cloud Transformer combined with dynamic graph convolutional networks (DGCNN), and conduct cross-subject representational comparisons via human psychophysical experiments. Results: The point cloud Transformer exhibits human-level consistency in shape recognition under challenging conditions—including low point density and viewpoint variation—outperforming CNNs; visual Transformers more closely align with human 3D cognitive representations than convolutional architectures. This work provides the first systematic evidence that hierarchical abstraction is critical for human-like 3D recognition, establishing a novel, interpretable paradigm for 3D vision modeling.
📝 Abstract
Both humans and deep learning models can recognize objects from 3D shapes depicted with sparse visual information, such as a set of points randomly sampled from the surfaces of 3D objects (termed a point cloud). Although deep learning models achieve human-like performance in recognizing objects from 3D shapes, it remains unclear whether these models develop 3D shape representations similar to those used by human vision for object recognition. We hypothesize that training with 3D shapes enables models to form representations of local geometric structures in 3D shapes. However, their representations of global 3D object shapes may be limited. We conducted two human experiments systematically manipulating point density and object orientation (Experiment 1), and local geometric structure (Experiment 2). Humans consistently performed well across all experimental conditions. We compared two types of deep learning models, one based on a convolutional neural network (DGCNN) and the other on visual transformers (point transformer), with human performance. We found that the point transformer model provided a better account of human performance than the convolution-based model. The advantage mainly results from the mechanism in the point transformer model that supports hierarchical abstraction of 3D shapes.