nnY-Net: Swin-NeXt with Cross-Attention for 3D Medical Images Segmentation

📅 2025-01-02

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

To address low segmentation accuracy, poor robustness to small structures, and slow convergence due to class imbalance in 3D medical image segmentation, this paper proposes nnY-Net—a novel U-shaped architecture. It integrates Swin Transformer and ConvNeXt into a unified encoder-decoder backbone for the first time and introduces a cross-modal cross-attention module at the bottleneck layer to form a Y-shaped structure, where patient-level clinical semantics (e.g., pathology, treatment) serve as queries to dynamically modulate low-level features. We further design DiceFocalCELoss to mitigate voxel-level class imbalance. The model adopts a lightweight preprocessing pipeline inspired by nnU-Net/dynUNet. On multiple 3D medical segmentation benchmarks, nnY-Net consistently outperforms SwinUNETR and MedNeXt, achieving an average Dice score improvement of 1.8–3.2%, enhanced robustness for small-structure segmentation, and 22% faster training convergence.

Technology Category

Application Category

📝 Abstract

This paper provides a novel 3D medical image segmentation model structure called nnY-Net. This name comes from the fact that our model adds a cross-attention module at the bottom of the U-net structure to form a Y structure. We integrate the advantages of the two latest SOTA models, MedNeXt and SwinUNETR, and use Swin Transformer as the encoder and ConvNeXt as the decoder to innovatively design the Swin-NeXt structure. Our model uses the lowest-level feature map of the encoder as Key and Value and uses patient features such as pathology and treatment information as Query to calculate the attention weights in a Cross Attention module. Moreover, we simplify some pre- and post-processing as well as data enhancement methods in 3D image segmentation based on the dynUnet and nnU-net frameworks. We integrate our proposed Swin-NeXt with Cross-Attention framework into this framework. Last, we construct a DiceFocalCELoss to improve the training efficiency for the uneven data convergence of voxel classification.

Problem

Research questions and friction points this paper is trying to address.

3D Medical Image Segmentation

Tissue Differentiation

Accuracy Enhancement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Swin-NeXt

Cross-Attention Module

DiceFocalCELoss

🔎 Similar Papers

Are Vision xLSTM Embedded UNet More Reliable in Medical 3D Image Segmentation?