Image Recognition with Online Lightweight Vision Transformer: A Survey

πŸ“… 2025-05-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the high computational cost and memory footprint of Vision Transformers (ViTs) in image classification, this paper presents a systematic survey and introduces the first online lightweight evaluation framework for ViT classification. We propose three synergistic optimization strategies: (1) efficient attention mechanisms coupled with modular pruning, (2) input-adaptive dynamic computation, and (3) multi-stage knowledge distillation with real-time inference scheduling. We establish, for the first time, a unified evaluation paradigm quantifying the multi-dimensional trade-offs among accuracy, parameter count, throughput, and memory usage on ImageNet-1K. Our best-performing model achieves 83.2% Top-1 accuracy with fewer than 15M parametersβ€”2.4Γ— faster and 37% more memory-efficient than ViT-Tiny. The work identifies dynamic sparsification and collaborative distillation as critical frontiers in lightweight ViT research and releases open-source code to ensure reproducibility.

Technology Category

Application Category

πŸ“ Abstract
The Transformer architecture has achieved significant success in natural language processing, motivating its adaptation to computer vision tasks. Unlike convolutional neural networks, vision transformers inherently capture long-range dependencies and enable parallel processing, yet lack inductive biases and efficiency benefits, facing significant computational and memory challenges that limit its real-world applicability. This paper surveys various online strategies for generating lightweight vision transformers for image recognition, focusing on three key areas: Efficient Component Design, Dynamic Network, and Knowledge Distillation. We evaluate the relevant exploration for each topic on the ImageNet-1K benchmark, analyzing trade-offs among precision, parameters, throughput, and more to highlight their respective advantages, disadvantages, and flexibility. Finally, we propose future research directions and potential challenges in the lightweighting of vision transformers with the aim of inspiring further exploration and providing practical guidance for the community. Project Page: https://github.com/ajxklo/Lightweight-VIT
Problem

Research questions and friction points this paper is trying to address.

Adapting Transformer architecture for efficient image recognition tasks
Addressing computational and memory challenges in vision transformers
Surveying lightweight strategies for vision transformers in real-world applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient Component Design for lightweight vision transformers
Dynamic Network strategies for adaptive processing
Knowledge Distillation to enhance model efficiency
πŸ”Ž Similar Papers
No similar papers found.
Z
Zherui Zhang
Beijing University of Posts and Telecommunications, China
Rongtao Xu
Rongtao Xu
MBZUAI << CASIA << HUST
Intelligent RobotEmbodied AIVLAVLMSpatialtemporal AI
J
Jie Zhou
Beijing University of Posts and Telecommunications, China
Changwei Wang
Changwei Wang
Shandong Computer Science Center
Multimodal LearningEmbodied AIEdge Intelligent ComputingAI for HealthcareSafety Alignment
X
Xingtian Pei
Beijing University of Posts and Telecommunications, China
Wenhao Xu
Wenhao Xu
Unknown affiliation
J
Jiguang Zhang
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, China
L
Li Guo
Beijing University of Posts and Telecommunications, China
Longxiang Gao
Longxiang Gao
Professor, Qilu University of Technology; Adjunct Professor, University of Southern Queensland
Edge AIFederated LearningMachine LearningQuantum Computing
Wenbo Xu
Wenbo Xu
Sun Yat-sen University
MultimodalMultimedia
Shibiao Xu
Shibiao Xu
Beijing University of Posts and Telecommunications
Computer VisionMachine LearningComputer Graphics