๐ค AI Summary
This study addresses the unclear alignment between Vision Transformers (ViTs) and human visual judgment in graphical perception tasks. For the first time, the classic graphical perception experimental paradigm of Cleveland and McGill is adapted to evaluate ViTs, establishing a controlled benchmark to systematically compare the performance of ViTs, convolutional neural networks (CNNs), and human participants on fundamental visual judgments. The findings reveal that, despite their strong performance on general vision tasks, ViTs exhibit significant deviations from human judgments in graphical perception, highlighting perceptual limitations in their ability to interpret visualizations. These results offer new insights for both model architecture design and the application of deep learning models in visualization contexts.
๐ Abstract
Vision Transformers, ViTs, have emerged as a powerful alternative to convolutional neural networks, CNNs, in a variety of image-based tasks. While CNNs have previously been evaluated for their ability to perform graphical perception tasks, which are essential for interpreting visualizations, the perceptual capabilities of ViTs remain largely unexplored. In this work, we investigate the performance of ViTs in elementary visual judgment tasks inspired by the foundational studies of Cleveland and McGill, which quantified the accuracy of human perception across different visual encodings. Inspired by their study, we benchmark ViTs against CNNs and human participants in a series of controlled graphical perception tasks. Our results reveal that, although ViTs demonstrate strong performance in general vision tasks, their alignment with human-like graphical perception in the visualization domain is limited. This study highlights key perceptual gaps and points to important considerations for the application of ViTs in visualization systems and graphical perceptual modeling.