🤖 AI Summary
This work addresses a critical limitation in existing whole-slide image multiple instance learning (WSI-MIL) models: despite achieving high AUC, their attention scores often reflect aggregation bias rather than identifying truly compact and sufficient evidence for predictions. To overcome this, the authors propose FOCI, a lightweight readout layer trained on a frozen MIL backbone with explicit objectives of sufficiency and exclusivity to select minimal sufficient tile subsets. They introduce the Selection Headroom Index (SHI) to quantify a model’s reliance on compact subsets and design a tailored Sequential Reveal Protocol for evaluation. Experiments across three WSI benchmarks and seven MIL backbones demonstrate that FOCI reduces the number of minimal sufficient tiles in TransMIL by 32–56%, while ACMIL+FOCI achieves the highest average SHI (+0.465), validating FOCI’s effectiveness as a model-level interpretability and auditing tool.
📝 Abstract
Whole-slide image (WSI) multiple instance learning (MIL) classifiers can achieve strong slide-level AUC while leaving the full-bag prediction opaque. Attention scores are widely reused as post-hoc explanations, but high attention can reflect aggregation preference rather than a compact, model-sufficient rationale. We study post-hoc rationale highlighting for frozen WSI-MIL: given a trained classifier, can its slide-level prediction be recovered from a compact, output-consistent tile subset without retraining the backbone? We instantiate this with Finding Optimal Contextual Instances (FOCI), a lightweight rationale-readout layer over a frozen MIL backbone. FOCI is trained with model-output sufficiency and exclusion objectives over keep/drop tile subsets, evaluated with an insertion-style Sequential Reveal Protocol (SRP) adapted to WSI-MIL, and summarized by the Selection Headroom Index (SHI). Across three WSI benchmarks and seven MIL backbones, FOCI reveals that compact rationales are selection-headroom dependent: transformer and multi-branch attention aggregators can admit compact rationales, near-minimal attention-pooling baselines enter a selection-saturation regime, and hard-selection backbones can conflict with an external readout. For TransMIL, relative to its documented CLS-proxy ranking, FOCI reduces the Minimum Sufficient K (MSK) tile count by 32-56% across benchmarks, while ACMIL+FOCI attains the highest mean SHI (+0.465). Deletion-based perturbation and selected-only downstream evaluation provide complementary checks. These results position FOCI as a model-level interpretability and audit layer: selected tiles are not claims of clinical or pathologist-level diagnostic sufficiency, but candidate rationales that offer a compact, reviewable view of when a frozen MIL prediction can be localized to a small output-consistent subset.