🤖 AI Summary
This work addresses the limited accessibility of existing interpretability methods for Vision Transformers (ViTs), which often focus on isolated components or cater primarily to experts, thereby lacking intuitive tools for understanding end-to-end reasoning. To bridge this gap, the authors propose a web-based interactive visualization system that uniquely integrates animated demonstrations, patch-level attention heatmaps, and a visually adapted Logit Lens, collectively illustrating the full inference pipeline—from image tokenization to final classification. The system supports both guided learning and free exploration modes, substantially lowering the barrier to comprehension for non-expert users. User studies demonstrate that the tool is not only easy to use but also effectively enhances users’ understanding of ViT internal mechanisms.
📝 Abstract
Transformer-based architectures have become the shared backbone of natural language processing and computer vision. However, understanding how these models operate remains challenging, particularly in vision settings, where images are processed as sequences of patch tokens. Existing interpretability tools often focus on isolated components or expert-oriented analysis, leaving a gap in guided, end-to-end understanding of the full inference pipeline. To bridge this gap, we present ViT-Explainer, a web-based interactive system that provides an integrated visualization of Vision Transformer inference, from patch tokenization to final classification. The system combines animated walkthroughs, patch-level attention overlays, and a vision-adapted Logit Lens within both guided and free exploration modes. A user study with six participants suggests that ViT-Explainer is easy to learn and use, helping users interpret and understand Vision Transformer behavior.