Scaling White-Box Transformers for Vision

πŸ“… 2024-05-30
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 3
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the scalability bottleneck of white-box vision transformers (CRATE), investigating whether scaling up such models preserves mathematical transparency while improving performance. We propose CRATE-Ξ±, a lightweight architecture, and a co-scaling training strategy, whose core innovation lies in a redesigned sparse coding module that jointly scales model capacity and data volume while enhancing unsupervised semantic segmentation capability and interpretability. To our knowledge, this is the first systematic validation of CRATE’s scalable design path, integrating scalable token representation learning with visualization-based analysis. Experiments demonstrate that CRATE-Ξ±-B achieves 83.2% top-1 accuracy on ImageNet (+3.7% over baseline), and CRATE-Ξ±-L reaches 85.1%. Larger variants significantly improve unsupervised image segmentation quality, confirming that high performance and white-box transparency can be simultaneously attained.

Technology Category

Application Category

πŸ“ Abstract
CRATE, a white-box transformer architecture designed to learn compressed and sparse representations, offers an intriguing alternative to standard vision transformers (ViTs) due to its inherent mathematical interpretability. Despite extensive investigations into the scaling behaviors of language and vision transformers, the scalability of CRATE remains an open question which this paper aims to address. Specifically, we propose CRATE-$alpha$, featuring strategic yet minimal modifications to the sparse coding block in the CRATE architecture design, and a light training recipe designed to improve the scalability of CRATE. Through extensive experiments, we demonstrate that CRATE-$alpha$ can effectively scale with larger model sizes and datasets. For example, our CRATE-$alpha$-B substantially outperforms the prior best CRATE-B model accuracy on ImageNet classification by 3.7%, achieving an accuracy of 83.2%. Meanwhile, when scaling further, our CRATE-$alpha$-L obtains an ImageNet classification accuracy of 85.1%. More notably, these model performance improvements are achieved while preserving, and potentially even enhancing the interpretability of learned CRATE models, as we demonstrate through showing that the learned token representations of increasingly larger trained CRATE-$alpha$ models yield increasingly higher-quality unsupervised object segmentation of images. The project page is https://rayjryang.github.io/CRATE-alpha/.
Problem

Research questions and friction points this paper is trying to address.

CRATE model
performance enhancement
visual tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

CRATE-Ξ±
image recognition accuracy
transparency
πŸ”Ž Similar Papers
No similar papers found.