MicroViT: A Vision Transformer with Low Complexity Self Attention for Edge Device

πŸ“… 2025-02-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the high computational overhead and deployment challenges of Vision Transformers (ViTs) on resource-constrained edge devices, this paper proposes MicroViTβ€”a lightweight ViT architecture. Methodologically, it introduces two key innovations: (1) Efficient Single-Head Attention (ESHA), which integrates grouped convolutions with channel-wise sparsification to drastically reduce self-attention complexity; and (2) a multi-stage MetaFormer backbone designed to balance representational capacity with hardware efficiency. Evaluated on ImageNet-1K and COCO, MicroViT achieves competitive accuracy while accelerating inference by 3.6Γ— over baseline ViTs. Moreover, it improves energy efficiency by 40% relative to the MobileViT series. These advances establish a new paradigm for efficient ViT deployment on edge platforms, enabling high-performance vision models under strict latency and power constraints.

Technology Category

Application Category

πŸ“ Abstract
The Vision Transformer (ViT) has demonstrated state-of-the-art performance in various computer vision tasks, but its high computational demands make it impractical for edge devices with limited resources. This paper presents MicroViT, a lightweight Vision Transformer architecture optimized for edge devices by significantly reducing computational complexity while maintaining high accuracy. The core of MicroViT is the Efficient Single Head Attention (ESHA) mechanism, which utilizes group convolution to reduce feature redundancy and processes only a fraction of the channels, thus lowering the burden of the self-attention mechanism. MicroViT is designed using a multi-stage MetaFormer architecture, stacking multiple MicroViT encoders to enhance efficiency and performance. Comprehensive experiments on the ImageNet-1K and COCO datasets demonstrate that MicroViT achieves competitive accuracy while significantly improving 3.6 faster inference speed and reducing energy consumption with 40% higher efficiency than the MobileViT series, making it suitable for deployment in resource-constrained environments such as mobile and edge devices.
Problem

Research questions and friction points this paper is trying to address.

Optimize Vision Transformer for edge devices
Reduce computational complexity in self-attention
Enhance efficiency and performance in resource-limited environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient Single Head Attention
Group convolution reduces redundancy
Multi-stage MetaFormer architecture
πŸ”Ž Similar Papers
No similar papers found.