🤖 AI Summary
Medical images exhibit high information density and semantic complexity, limiting the performance of existing lightweight models—primarily designed for natural images—on mobile segmentation tasks. To address this, we propose a mobile-optimized lightweight U-shaped vision transformer architecture. Our method integrates a CNN–Transformer hybrid design featuring hierarchical large-kernel convolutional patch embedding (ConvUtr), a local–global–local (LGL) attention mechanism, inverted bottleneck fusion, and down-sampling skip connections. It further incorporates shallow Transformer bottlenecks, a cascaded decoder, and parameter-efficient large-kernel convolutions to support multimodal 2D/3D medical image segmentation. Evaluated on eight public benchmarks, our model achieves state-of-the-art (SOTA) performance; it also demonstrates superior zero-shot transfer capability on four unseen datasets, significantly outperforming existing mobile-friendly models.
📝 Abstract
In clinical practice, medical image analysis often requires efficient execution on resource-constrained mobile devices. However, existing mobile models-primarily optimized for natural images-tend to perform poorly on medical tasks due to the significant information density gap between natural and medical domains. Combining computational efficiency with medical imaging-specific architectural advantages remains a challenge when developing lightweight, universal, and high-performing networks. To address this, we propose a mobile model called Mobile U-shaped Vision Transformer (Mobile U-ViT) tailored for medical image segmentation. Specifically, we employ the newly purposed ConvUtr as a hierarchical patch embedding, featuring a parameter-efficient large-kernel CNN with inverted bottleneck fusion. This design exhibits transformer-like representation learning capacity while being lighter and faster. To enable efficient local-global information exchange, we introduce a novel Large-kernel Local-Global-Local (LGL) block that effectively balances the low information density and high-level semantic discrepancy of medical images. Finally, we incorporate a shallow and lightweight transformer bottleneck for long-range modeling and employ a cascaded decoder with downsample skip connections for dense prediction. Despite its reduced computational demands, our medical-optimized architecture achieves state-of-the-art performance across eight public 2D and 3D datasets covering diverse imaging modalities, including zero-shot testing on four unseen datasets. These results establish it as an efficient yet powerful and generalization solution for mobile medical image analysis. Code is available at https://github.com/FengheTan9/Mobile-U-ViT.