🤖 AI Summary
Standard vision Transformers struggle to model hierarchical anatomical community structures—such as organs, tissues, and lesions—in medical images. To address this, we propose DCMM-Transformer, the first framework to embed a differentiable Degree-Corrected Mixed-Membership (DCMM) model into the self-attention mechanism as a structural-aware additive bias, replacing non-differentiable binary masks. This enables end-to-end modeling of complex anatomical communities while enhancing interpretability. The design ensures training stability, yields anatomically consistent attention maps, and supports cross-modal generalization. Evaluated on multi-center datasets spanning brain, chest, breast, and ocular imaging, DCMM-Transformer significantly outperforms state-of-the-art methods in both accuracy and clinical interpretability, bridging the gap between high performance and anatomical plausibility.
📝 Abstract
Medical images exhibit latent anatomical groupings, such as organs, tissues, and pathological regions, that standard Vision Transformers (ViTs) fail to exploit. While recent work like SBM-Transformer attempts to incorporate such structures through stochastic binary masking, they suffer from non-differentiability, training instability, and the inability to model complex community structure. We present DCMM-Transformer, a novel ViT architecture for medical image analysis that incorporates a Degree-Corrected Mixed-Membership (DCMM) model as an additive bias in self-attention. Unlike prior approaches that rely on multiplicative masking and binary sampling, our method introduces community structure and degree heterogeneity in a fully differentiable and interpretable manner. Comprehensive experiments across diverse medical imaging datasets, including brain, chest, breast, and ocular modalities, demonstrate the superior performance and generalizability of the proposed approach. Furthermore, the learned group structure and structured attention modulation substantially enhance interpretability by yielding attention maps that are anatomically meaningful and semantically coherent.