🤖 AI Summary
This work addresses the challenge of overfitting and limited segmentation performance in sparse vision Transformers—such as Swin Transformers—when trained on small-scale medical imaging datasets. To mitigate this, the authors propose a novel architecture that integrates Inception-style multi-branch convolutional modules into the Swin Transformer’s feed-forward network for the first time, thereby enhancing local multi-scale feature modeling. Additionally, a lightweight decoder is designed to efficiently recover fine-grained details. By synergistically combining local attention mechanisms with multi-scale convolutions, the model strengthens its inductive bias, significantly improving generalization and segmentation accuracy under data-scarce conditions. The proposed method consistently outperforms current state-of-the-art approaches across eleven medical image segmentation benchmarks, including the Medical Segmentation Decathlon.
📝 Abstract
Sparse vision transformers have gained popularity as efficient encoders for medical volumetric segmentation, with Swin emerging as a prominent choice. Swin uses local attention to reduce complexity and yields excellent performance for many tasks but still tends to overfit on small datasets. To mitigate this weakness, we propose a novel architecture that further enhances Swin's inductive bias by introducing Inception blocks in the feed-forward layers. The introduction of these multi-branch convolutions enables more direct reasoning over local, multi-scale features within the transformer block. We have also modified the decoder layers in order to capture finer details using fewer parameters. We demonstrate a performance improvement on eleven different medical datasets through extensive experimentation. We specifically showcase advancements over the previous state-of-the-art backbones on benchmark challenges like the Medical Segmentation Decathlon and Beyond the Cranial Vault. By showing that the existing inductive bias in Swin can be further improved, our work presents a promising avenue for enhancing the capabilities of sparse vision transformers for both medical and natural image segmentation tasks. Code and pre-trained weights can be accessed at https://github.com/Eiphodos/SwInception.