Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation

📅 2025-08-01

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Medical images exhibit high information density and semantic complexity, limiting the performance of existing lightweight models—primarily designed for natural images—on mobile segmentation tasks. To address this, we propose a mobile-optimized lightweight U-shaped vision transformer architecture. Our method integrates a CNN–Transformer hybrid design featuring hierarchical large-kernel convolutional patch embedding (ConvUtr), a local–global–local (LGL) attention mechanism, inverted bottleneck fusion, and down-sampling skip connections. It further incorporates shallow Transformer bottlenecks, a cascaded decoder, and parameter-efficient large-kernel convolutions to support multimodal 2D/3D medical image segmentation. Evaluated on eight public benchmarks, our model achieves state-of-the-art (SOTA) performance; it also demonstrates superior zero-shot transfer capability on four unseen datasets, significantly outperforming existing mobile-friendly models.

Technology Category

Application Category

📝 Abstract

In clinical practice, medical image analysis often requires efficient execution on resource-constrained mobile devices. However, existing mobile models-primarily optimized for natural images-tend to perform poorly on medical tasks due to the significant information density gap between natural and medical domains. Combining computational efficiency with medical imaging-specific architectural advantages remains a challenge when developing lightweight, universal, and high-performing networks. To address this, we propose a mobile model called Mobile U-shaped Vision Transformer (Mobile U-ViT) tailored for medical image segmentation. Specifically, we employ the newly purposed ConvUtr as a hierarchical patch embedding, featuring a parameter-efficient large-kernel CNN with inverted bottleneck fusion. This design exhibits transformer-like representation learning capacity while being lighter and faster. To enable efficient local-global information exchange, we introduce a novel Large-kernel Local-Global-Local (LGL) block that effectively balances the low information density and high-level semantic discrepancy of medical images. Finally, we incorporate a shallow and lightweight transformer bottleneck for long-range modeling and employ a cascaded decoder with downsample skip connections for dense prediction. Despite its reduced computational demands, our medical-optimized architecture achieves state-of-the-art performance across eight public 2D and 3D datasets covering diverse imaging modalities, including zero-shot testing on four unseen datasets. These results establish it as an efficient yet powerful and generalization solution for mobile medical image analysis. Code is available at https://github.com/FengheTan9/Mobile-U-ViT.

Problem

Research questions and friction points this paper is trying to address.

Efficient medical image segmentation on mobile devices

Bridging information density gap in medical vs natural images

Balancing computational efficiency with medical-specific architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

ConvUtr for hierarchical patch embedding

Large-kernel LGL block for information exchange

Lightweight transformer bottleneck for long-range modeling

🔎 Similar Papers

LV-UNet: A Lightweight and Vanilla Model for Medical Image Segmentation