Do Vision Transformers See Like Humans? Evaluating their Perceptual Alignment

📅 2025-08-13

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study systematically investigates the perceptual alignment of Vision Transformers (ViTs) with human visual perception in image recognition, examining how model scale, training data volume, data augmentation, and regularization strategies affect such alignment. Method: Leveraging the TID2013 dataset, we conduct multi-round training and cross-scale ViT model comparisons to establish a reproducible evaluation framework for perceptual alignment. Contribution/Results: We find that stronger data augmentation and regularization significantly degrade alignment with human judgments; increasing model size or performing repeated training also reduces alignment; conversely, data diversity exerts only marginal influence. This work is the first to uncover an inherent trade-off between model complexity and training strategies in human-aligned perception modeling. Our findings provide theoretical foundations and practical guidance for applications requiring human-like visual understanding—such as embodied intelligence and human–computer interaction—where perceptual fidelity is critical.

Technology Category

Application Category

📝 Abstract

Vision Transformers (ViTs) achieve remarkable performance in image recognition tasks, yet their alignment with human perception remains largely unexplored. This study systematically analyzes how model size, dataset size, data augmentation and regularization impact ViT perceptual alignment with human judgments on the TID2013 dataset. Our findings confirm that larger models exhibit lower perceptual alignment, consistent with previous works. Increasing dataset diversity has a minimal impact, but exposing models to the same images more times reduces alignment. Stronger data augmentation and regularization further decrease alignment, especially in models exposed to repeated training cycles. These results highlight a trade-off between model complexity, training strategies, and alignment with human perception, raising important considerations for applications requiring human-like visual understanding.

Problem

Research questions and friction points this paper is trying to address.

Evaluating Vision Transformers' alignment with human perception

Analyzing impact of model size, data diversity on perceptual alignment

Exploring trade-offs between model complexity and human-like understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing ViT perceptual alignment with humans

Larger models reduce perceptual alignment

Data augmentation decreases alignment further

🔎 Similar Papers

Brain Mapping with Dense Features: Grounding Cortical Semantic Selectivity in Natural Images With Vision Transformers