Co-distilled attention guided masked image modeling with noisy teacher for self-supervised learning on medical images

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
This work addresses two key challenges in self-supervised learning for medical images: information leakage caused by random masking—due to high semantic similarity among neighboring image patches—and the difficulty of implementing advanced masking strategies in Swin Transformers, which lack a global [CLS] token. To overcome these limitations, the authors propose DAGMaN, a novel framework that integrates a noisy teacher model into collaborative distillation and introduces an attention-guided masking mechanism. This approach selectively masks semantically co-occurring yet discriminative image patches within the Swin Transformer architecture, effectively preserving attention head diversity. Extensive experiments demonstrate that DAGMaN significantly outperforms existing methods across diverse tasks, including lung nodule classification (both full-data and few-shot settings), immunotherapy response prediction, tumor segmentation, and unsupervised organ clustering.

Technology Category

Application Category

📝 Abstract
Masked image modeling (MIM) is a highly effective self-supervised learning (SSL) approach to extract useful feature representations from unannotated data. Predominantly used random masking methods make SSL less effective for medical images due to the contextual similarity of neighboring patches, leading to information leakage and SSL simplification. Hierarchical shifted window (Swin) transformer, a highly effective approach for medical images cannot use advanced masking methods as it lacks a global [CLS] token. Hence, we introduced an attention guided masking mechanism for Swin within a co-distillation learning framework to selectively mask semantically co-occurring and discriminative patches, to reduce information leakage and increase the difficulty of SSL pretraining. However, attention guided masking inevitably reduces the diversity of attention heads, which negatively impacts downstream task performance. To address this, we for the first time, integrate a noisy teacher into the co-distillation framework (termed DAGMaN) that performs attentive masking while preserving high attention head diversity. We demonstrate the capability of DAGMaN on multiple tasks including full- and few-shot lung nodule classification, immunotherapy outcome prediction, tumor segmentation, and unsupervised organs clustering.
Problem

Research questions and friction points this paper is trying to address.

masked image modeling
self-supervised learning
medical images
attention diversity
information leakage
Innovation

Methods, ideas, or system contributions that make the work stand out.

attention-guided masking
co-distillation
noisy teacher
masked image modeling
Swin Transformer