🤖 AI Summary
Existing multi-instance learning (MIL) methods for whole-slide image (WSI) classification suffer from three key limitations: attention mechanisms neglect local histological context; Transformers incur high computational complexity and are prone to overfitting; and state-space models (SSMs) degrade pathological interpretability due to random token ordering. To address these, we propose SemaMIL—a novel MIL framework. Its core contributions are: (i) a clustering-driven *reversible semantic reordering* that preserves histological spatial coherence while enhancing sequence-level semantic consistency; and (ii) *retrieval-guided state-space modeling*, which enables efficient long-range dependency capture via query-subset retrieval. This design achieves linear-time complexity while supporting high-order feature interactions. Evaluated on four WSI subtyping benchmarks, SemaMIL attains state-of-the-art accuracy, reduces FLOPs and parameter count significantly, and improves both model interpretability and clinical applicability.
📝 Abstract
Multiple instance learning (MIL) has become the leading approach for extracting discriminative features from whole slide images (WSIs) in computational pathology. Attention-based MIL methods can identify key patches but tend to overlook contextual relationships. Transformer models are able to model interactions but require quadratic computational cost and are prone to overfitting. State space models (SSMs) offer linear complexity, yet shuffling patch order disrupts histological meaning and reduces interpretability. In this work, we introduce SemaMIL, which integrates Semantic Reordering (SR), an adaptive method that clusters and arranges semantically similar patches in sequence through a reversible permutation, with a Semantic-guided Retrieval State Space Module (SRSM) that chooses a representative subset of queries to adjust state space parameters for improved global modeling. Evaluation on four WSI subtype datasets shows that, compared to strong baselines, SemaMIL achieves state-of-the-art accuracy with fewer FLOPs and parameters.