🤖 AI Summary
Large vision models (e.g., ViT) suffer from high computational cost, poor interpretability, and reliance on discrete attention mechanisms. To address these limitations, we propose ODE-ViT: a continuous-time reformulation of ViT as a well-posed and stable ordinary differential equation (ODE) system, wherein attention dynamics are modeled as smooth, continuous evolution rather than discrete steps. We further introduce a plug-and-play teacher–student framework that leverages intermediate representations from a discrete ViT to guide the learning trajectory of the ODE solver, thereby improving training efficiency and generalization. Evaluated on CIFAR-10 and CIFAR-100, ODE-ViT reduces parameter count by approximately 90% while achieving over 10% higher classification accuracy than existing ODE-based Transformers. Our approach enables efficient, compact, and more interpretable visual representation learning through principled continuous dynamics.
📝 Abstract
In recent years, increasingly large models have achieved outstanding performance across CV tasks. However, these models demand substantial computational resources and storage, and their growing complexity limits our understanding of how they make decisions. Most of these architectures rely on the attention mechanism within Transformer-based designs. Building upon the connection between residual neural networks and ordinary differential equations (ODEs), we introduce ODE-ViT, a Vision Transformer reformulated as an ODE system that satisfies the conditions for well-posed and stable dynamics. Experiments on CIFAR-10 and CIFAR-100 demonstrate that ODE-ViT achieves stable, interpretable, and competitive performance with up to one order of magnitude fewer parameters, surpassing prior ODE-based Transformer approaches in classification tasks. We further propose a plug-and-play teacher-student framework in which a discrete ViT guides the continuous trajectory of ODE-ViT by treating the intermediate representations of the teacher as solutions of the ODE. This strategy improves performance by more than 10% compared to training a free ODE-ViT from scratch.