🤖 AI Summary
Understanding the robustness and underlying mechanisms of human vision versus state-of-the-art vision and multimodal foundation models in recognizing objects under unconventional poses remains an open challenge.
Method: We conducted time-limited behavioral experiments with human participants, cross-model evaluation across vision models (EfficientNet, ViT, Swin) and multimodal LLMs (Claude 3.5, GPT-4, Gemini 1.5), and fine-grained error pattern analysis.
Contribution/Results: Humans significantly outperform all tested models except Gemini 1.5, which approaches human-level accuracy. Crucially, human performance degrades sharply under shortened exposure times, indicating reliance on slow, sequential high-level cognitive processing—not rapid feedforward computation. We further uncover a fundamental mechanistic divergence in pose-invariant recognition: humans require explicit sequential reasoning, whereas deep networks are constrained by static, pose-locked internal representations. This work provides the first empirical evidence and theoretical framework for the distinct computational principles governing biological versus artificial vision under pose variation, offering critical insights for advancing AI’s pose generalization capabilities.
📝 Abstract
Deep learning is closing the gap with human vision on several object recognition benchmarks. Here we investigate this gap for challenging images where objects are seen in unusual poses. We find that humans excel at recognizing objects in such poses. In contrast, state-of-the-art deep networks for vision (EfficientNet, SWAG, ViT, SWIN, BEiT, ConvNext) and state-of-the-art large vision-language models (Claude 3.5, Gemini 1.5, GPT-4) are systematically brittle on unusual poses, with the exception of Gemini showing excellent robustness in that condition. As we limit image exposure time, human performance degrades to the level of deep networks, suggesting that additional mental processes (requiring additional time) are necessary to identify objects in unusual poses. An analysis of error patterns of humans vs. networks reveals that even time-limited humans are dissimilar to feed-forward deep networks. In conclusion, our comparison reveals that humans and deep networks rely on different mechanisms for recognizing objects in unusual poses. Understanding the nature of the mental processes taking place during extra viewing time may be key to reproduce the robustness of human vision in silico.