🤖 AI Summary
This work addresses the challenge of high computational cost in end-to-end fine-tuning of foundation models for mammography analysis, which arises from the high resolution of mammographic images, scarce annotations, and predominantly breast-level labels. To overcome this, the authors propose MIL-PF, a framework that freezes a pretrained vision encoder, precomputes patch-level features, and introduces a lightweight attention-based multiple instance learning (MIL) aggregation module containing only 40k parameters. This design enables joint modeling of global tissue context and sparse local lesion signals without retraining large models. Evaluated on clinically scaled datasets, MIL-PF achieves state-of-the-art performance in breast cancer classification while substantially reducing training resource requirements. The code is publicly released to ensure reproducibility.
📝 Abstract
Modern foundation models provide highly expressive visual representations, yet adapting them to high-resolution medical imaging remains challenging due to limited annotations and weak supervision. Mammography, in particular, is characterized by large images, variable multi-view studies and predominantly breast-level labels, making end-to-end fine-tuning computationally expensive and often impractical. We propose Multiple Instance Learning on Precomputed Features (MIL-PF), a scalable framework that combines frozen foundation encoders with a lightweight MIL head for mammography classification. By precomputing the semantic representations and training only a small task-specific aggregation module (40k parameters), the method enables efficient experimentation and adaptation without retraining large backbones. The architecture explicitly models the global tissue context and the sparse local lesion signals through attention-based aggregation. MIL-PF achieves state-of-the-art classification performance at clinical scale while substantially reducing training complexity. We release the code for full reproducibility.