SwInception -- Local Attention Meets Convolutions

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of overfitting and limited segmentation performance in sparse vision Transformers—such as Swin Transformers—when trained on small-scale medical imaging datasets. To mitigate this, the authors propose a novel architecture that integrates Inception-style multi-branch convolutional modules into the Swin Transformer’s feed-forward network for the first time, thereby enhancing local multi-scale feature modeling. Additionally, a lightweight decoder is designed to efficiently recover fine-grained details. By synergistically combining local attention mechanisms with multi-scale convolutions, the model strengthens its inductive bias, significantly improving generalization and segmentation accuracy under data-scarce conditions. The proposed method consistently outperforms current state-of-the-art approaches across eleven medical image segmentation benchmarks, including the Medical Segmentation Decathlon.
📝 Abstract
Sparse vision transformers have gained popularity as efficient encoders for medical volumetric segmentation, with Swin emerging as a prominent choice. Swin uses local attention to reduce complexity and yields excellent performance for many tasks but still tends to overfit on small datasets. To mitigate this weakness, we propose a novel architecture that further enhances Swin's inductive bias by introducing Inception blocks in the feed-forward layers. The introduction of these multi-branch convolutions enables more direct reasoning over local, multi-scale features within the transformer block. We have also modified the decoder layers in order to capture finer details using fewer parameters. We demonstrate a performance improvement on eleven different medical datasets through extensive experimentation. We specifically showcase advancements over the previous state-of-the-art backbones on benchmark challenges like the Medical Segmentation Decathlon and Beyond the Cranial Vault. By showing that the existing inductive bias in Swin can be further improved, our work presents a promising avenue for enhancing the capabilities of sparse vision transformers for both medical and natural image segmentation tasks. Code and pre-trained weights can be accessed at https://github.com/Eiphodos/SwInception.
Problem

Research questions and friction points this paper is trying to address.

overfitting
medical image segmentation
sparse vision transformers
small datasets
inductive bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

SwInception
local attention
Inception blocks
sparse vision transformers
medical image segmentation
💼 Related Jobs
Vision Foundation Model Research Intern
Intrinsic
Salary Range$57.69—$57.69 USDAt Intrinsic, we are proud to be an equal opportunity workplace. Employment at Intrinsic is based solely on a person's merit and qualifications directly related to professional competence. Intrinsic does not discriminate against any employee or applicant because of race, creed, color, religion, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition (including breastfeeding), or any other basis protected by law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. It is Intrinsic’s policy to comply with all applicable national, state and local laws pertaining to nondiscrimination and equal opportunity.
Mountain View, California / Mountain View (US-MTV), Mountain View, California, United States