ODE-ViT: Plug & Play Attention Layer from the Generalization of the ViT as an Ordinary Differential Equation

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large vision models (e.g., ViT) suffer from high computational cost, poor interpretability, and reliance on discrete attention mechanisms. To address these limitations, we propose ODE-ViT: a continuous-time reformulation of ViT as a well-posed and stable ordinary differential equation (ODE) system, wherein attention dynamics are modeled as smooth, continuous evolution rather than discrete steps. We further introduce a plug-and-play teacher–student framework that leverages intermediate representations from a discrete ViT to guide the learning trajectory of the ODE solver, thereby improving training efficiency and generalization. Evaluated on CIFAR-10 and CIFAR-100, ODE-ViT reduces parameter count by approximately 90% while achieving over 10% higher classification accuracy than existing ODE-based Transformers. Our approach enables efficient, compact, and more interpretable visual representation learning through principled continuous dynamics.

Technology Category

Application Category

📝 Abstract
In recent years, increasingly large models have achieved outstanding performance across CV tasks. However, these models demand substantial computational resources and storage, and their growing complexity limits our understanding of how they make decisions. Most of these architectures rely on the attention mechanism within Transformer-based designs. Building upon the connection between residual neural networks and ordinary differential equations (ODEs), we introduce ODE-ViT, a Vision Transformer reformulated as an ODE system that satisfies the conditions for well-posed and stable dynamics. Experiments on CIFAR-10 and CIFAR-100 demonstrate that ODE-ViT achieves stable, interpretable, and competitive performance with up to one order of magnitude fewer parameters, surpassing prior ODE-based Transformer approaches in classification tasks. We further propose a plug-and-play teacher-student framework in which a discrete ViT guides the continuous trajectory of ODE-ViT by treating the intermediate representations of the teacher as solutions of the ODE. This strategy improves performance by more than 10% compared to training a free ODE-ViT from scratch.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational demands of large Vision Transformer models
Enhancing model interpretability through ODE-based reformulation
Improving performance with fewer parameters using teacher-student framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulates Vision Transformer as ODE system
Uses plug-and-play teacher-student framework
Achieves competitive performance with fewer parameters
🔎 Similar Papers
No similar papers found.
C
Carlos Boned Riera
Computer Vision Center (CVC), Universitat Aut`onoma de Barcelona
D
David Romero Sanchez
Mathematical Research Center (CRM), Universitat Aut`onoma de Barcelona
Oriol Ramos Terrades
Oriol Ramos Terrades
Dep. Ciències de la Computació, Universitat autònoma de Barcelona - Computer Vision Centre
machine learningcomputer visionpattern recognition