π€ AI Summary
To address low segmentation accuracy, poor robustness to small structures, and slow convergence due to class imbalance in 3D medical image segmentation, this paper proposes nnY-Netβa novel U-shaped architecture. It integrates Swin Transformer and ConvNeXt into a unified encoder-decoder backbone for the first time and introduces a cross-modal cross-attention module at the bottleneck layer to form a Y-shaped structure, where patient-level clinical semantics (e.g., pathology, treatment) serve as queries to dynamically modulate low-level features. We further design DiceFocalCELoss to mitigate voxel-level class imbalance. The model adopts a lightweight preprocessing pipeline inspired by nnU-Net/dynUNet. On multiple 3D medical segmentation benchmarks, nnY-Net consistently outperforms SwinUNETR and MedNeXt, achieving an average Dice score improvement of 1.8β3.2%, enhanced robustness for small-structure segmentation, and 22% faster training convergence.
π Abstract
This paper provides a novel 3D medical image segmentation model structure called nnY-Net. This name comes from the fact that our model adds a cross-attention module at the bottom of the U-net structure to form a Y structure. We integrate the advantages of the two latest SOTA models, MedNeXt and SwinUNETR, and use Swin Transformer as the encoder and ConvNeXt as the decoder to innovatively design the Swin-NeXt structure. Our model uses the lowest-level feature map of the encoder as Key and Value and uses patient features such as pathology and treatment information as Query to calculate the attention weights in a Cross Attention module. Moreover, we simplify some pre- and post-processing as well as data enhancement methods in 3D image segmentation based on the dynUnet and nnU-net frameworks. We integrate our proposed Swin-NeXt with Cross-Attention framework into this framework. Last, we construct a DiceFocalCELoss to improve the training efficiency for the uneven data convergence of voxel classification.