🤖 AI Summary
This paper addresses the long-tailed class distribution problem in whole-slide image (WSI) analysis for computational pathology. We propose a multi-expert ensemble framework built upon multi-instance learning (MIL), featuring a shared aggregator and multiple specialized decoders to jointly model distributional diversity. To enhance semantic-aware representation learning for minority classes, we introduce a learnable text-prompt-guided multimodal knowledge distillation mechanism. Our approach innovatively integrates a pre-trained pathology text encoder, dynamic prompt tuning, and consistency regularization to strengthen discriminative capability for sparse categories. Evaluated on two long-tailed WSI benchmarks—Camelyon+-LT and PANDA-LT—the method achieves over 8.2% absolute improvement in minority-class classification accuracy over state-of-the-art methods, while demonstrating significantly enhanced generalization and robustness.
📝 Abstract
Multiple Instance Learning (MIL) plays a significant role in computational pathology, enabling weakly supervised analysis of Whole Slide Image (WSI) datasets. The field of WSI analysis is confronted with a severe long-tailed distribution problem, which significantly impacts the performance of classifiers. Long-tailed distributions lead to class imbalance, where some classes have sparse samples while others are abundant, making it difficult for classifiers to accurately identify minority class samples. To address this issue, we propose an ensemble learning method based on MIL, which employs expert decoders with shared aggregators and consistency constraints to learn diverse distributions and reduce the impact of class imbalance on classifier performance. Moreover, we introduce a multimodal distillation framework that leverages text encoders pre-trained on pathology-text pairs to distill knowledge and guide the MIL aggregator in capturing stronger semantic features relevant to class information. To ensure flexibility, we use learnable prompts to guide the distillation process of the pre-trained text encoder, avoiding limitations imposed by specific prompts. Our method, MDE-MIL, integrates multiple expert branches focusing on specific data distributions to address long-tailed issues. Consistency control ensures generalization across classes. Multimodal distillation enhances feature extraction. Experiments on Camelyon+-LT and PANDA-LT datasets show it outperforms state-of-the-art methods.