Do Vision Transformers See Like Humans? Evaluating their Perceptual Alignment

📅 2025-08-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically investigates the perceptual alignment of Vision Transformers (ViTs) with human visual perception in image recognition, examining how model scale, training data volume, data augmentation, and regularization strategies affect such alignment. Method: Leveraging the TID2013 dataset, we conduct multi-round training and cross-scale ViT model comparisons to establish a reproducible evaluation framework for perceptual alignment. Contribution/Results: We find that stronger data augmentation and regularization significantly degrade alignment with human judgments; increasing model size or performing repeated training also reduces alignment; conversely, data diversity exerts only marginal influence. This work is the first to uncover an inherent trade-off between model complexity and training strategies in human-aligned perception modeling. Our findings provide theoretical foundations and practical guidance for applications requiring human-like visual understanding—such as embodied intelligence and human–computer interaction—where perceptual fidelity is critical.

Technology Category

Application Category

📝 Abstract
Vision Transformers (ViTs) achieve remarkable performance in image recognition tasks, yet their alignment with human perception remains largely unexplored. This study systematically analyzes how model size, dataset size, data augmentation and regularization impact ViT perceptual alignment with human judgments on the TID2013 dataset. Our findings confirm that larger models exhibit lower perceptual alignment, consistent with previous works. Increasing dataset diversity has a minimal impact, but exposing models to the same images more times reduces alignment. Stronger data augmentation and regularization further decrease alignment, especially in models exposed to repeated training cycles. These results highlight a trade-off between model complexity, training strategies, and alignment with human perception, raising important considerations for applications requiring human-like visual understanding.
Problem

Research questions and friction points this paper is trying to address.

Evaluating Vision Transformers' alignment with human perception
Analyzing impact of model size, data diversity on perceptual alignment
Exploring trade-offs between model complexity and human-like understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing ViT perceptual alignment with humans
Larger models reduce perceptual alignment
Data augmentation decreases alignment further
🔎 Similar Papers
No similar papers found.