🤖 AI Summary
How can human-level 3D shape perception be achieved from 2D visual inputs? This work proposes a self-supervised neural network that infers the 3D structure of arbitrary objects solely from multi-view natural images and associated visuo-spatial signals—such as camera poses and depth—without relying on object-specific inductive biases or task-specific training. The model achieves accuracy on standard 3D shape inference tasks comparable to human performance, reproduces characteristic human error patterns and reaction times, and exhibits internal representational dynamics that closely align with human perceptual behavior. Notably, this is the first demonstration of systematic alignment with human behavioral data under zero-shot conditions, establishing a new benchmark for biologically plausible 3D vision models.
📝 Abstract
Humans can infer the three-dimensional structure of objects from two-dimensional visual inputs. Modeling this ability has been a longstanding goal for the science and engineering of visual intelligence, yet decades of computational methods have fallen short of human performance. Here we develop a modeling framework that predicts human 3D shape inferences for arbitrary objects, directly from experimental stimuli. We achieve this with a novel class of neural networks trained using a visual-spatial objective over naturalistic sensory data; given a set of images taken from different locations within a natural scene, these models learn to predict spatial information related to these images, such as camera location and visual depth, without relying on any object-related inductive biases. Notably, these visual-spatial signals are analogous to sensory cues readily available to humans. We design a zero-shot evaluation approach to determine the performance of these `multi-view' models on a well established 3D perception task, then compare model and human behavior. Our modeling framework is the first to match human accuracy on 3D shape inferences, even without task-specific training or fine-tuning. Remarkably, independent readouts of model responses predict fine-grained measures of human behavior, including error patterns and reaction times, revealing a natural correspondence between model dynamics and human perception. Taken together, our findings indicate that human-level 3D perception can emerge from a simple, scalable learning objective over naturalistic visual-spatial data. All code, human behavioral data, and experimental stimuli needed to reproduce our findings can be found on our project page.