Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical images exhibit high information density and semantic complexity, limiting the performance of existing lightweight models—primarily designed for natural images—on mobile segmentation tasks. To address this, we propose a mobile-optimized lightweight U-shaped vision transformer architecture. Our method integrates a CNN–Transformer hybrid design featuring hierarchical large-kernel convolutional patch embedding (ConvUtr), a local–global–local (LGL) attention mechanism, inverted bottleneck fusion, and down-sampling skip connections. It further incorporates shallow Transformer bottlenecks, a cascaded decoder, and parameter-efficient large-kernel convolutions to support multimodal 2D/3D medical image segmentation. Evaluated on eight public benchmarks, our model achieves state-of-the-art (SOTA) performance; it also demonstrates superior zero-shot transfer capability on four unseen datasets, significantly outperforming existing mobile-friendly models.

Technology Category

Application Category

📝 Abstract
In clinical practice, medical image analysis often requires efficient execution on resource-constrained mobile devices. However, existing mobile models-primarily optimized for natural images-tend to perform poorly on medical tasks due to the significant information density gap between natural and medical domains. Combining computational efficiency with medical imaging-specific architectural advantages remains a challenge when developing lightweight, universal, and high-performing networks. To address this, we propose a mobile model called Mobile U-shaped Vision Transformer (Mobile U-ViT) tailored for medical image segmentation. Specifically, we employ the newly purposed ConvUtr as a hierarchical patch embedding, featuring a parameter-efficient large-kernel CNN with inverted bottleneck fusion. This design exhibits transformer-like representation learning capacity while being lighter and faster. To enable efficient local-global information exchange, we introduce a novel Large-kernel Local-Global-Local (LGL) block that effectively balances the low information density and high-level semantic discrepancy of medical images. Finally, we incorporate a shallow and lightweight transformer bottleneck for long-range modeling and employ a cascaded decoder with downsample skip connections for dense prediction. Despite its reduced computational demands, our medical-optimized architecture achieves state-of-the-art performance across eight public 2D and 3D datasets covering diverse imaging modalities, including zero-shot testing on four unseen datasets. These results establish it as an efficient yet powerful and generalization solution for mobile medical image analysis. Code is available at https://github.com/FengheTan9/Mobile-U-ViT.
Problem

Research questions and friction points this paper is trying to address.

Efficient medical image segmentation on mobile devices
Bridging information density gap in medical vs natural images
Balancing computational efficiency with medical-specific architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

ConvUtr for hierarchical patch embedding
Large-kernel LGL block for information exchange
Lightweight transformer bottleneck for long-range modeling
🔎 Similar Papers
No similar papers found.
Fenghe Tang
Fenghe Tang
University of Science and Technology of China
Medical Image AnalysisFoundation model
B
Bingkun Nian
School of Automation and Intelligent Sensing, Shanghai Jiao Tong University; Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University; Institute of Medical Robotics, Shanghai Jiao Tong University
J
Jianrui Ding
School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
Wenxin Ma
Wenxin Ma
University of Science and Technology of China
AIcomputer vision
Q
Quan Quan
State Grid Hunan ElectricPower Corporation Limited Research Institute
C
Chengqi Dong
School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, 230026, P.R. China; Center for Medical Imaging, Robotics, Analytic Computing & Learning (MIRACLE), Suzhou Institute for Advanced Research, USTC, 215123, P.R. China
J
Jie Yang
School of Automation and Intelligent Sensing, Shanghai Jiao Tong University; Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University; Institute of Medical Robotics, Shanghai Jiao Tong University
W
Wei Liu
School of Automation and Intelligent Sensing, Shanghai Jiao Tong University; Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University; Institute of Medical Robotics, Shanghai Jiao Tong University
S
S. Kevin Zhou
School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, 230026, P.R. China; Center for Medical Imaging, Robotics, Analytic Computing & Learning (MIRACLE), Suzhou Institute for Advanced Research, USTC, 215123, P.R. China