๐ค AI Summary
Digital pathology whole-slide image (WSI) analysis faces challenges including extreme image scale, sparse annotations, catastrophic forgetting during fine-tuning, and insufficient cross-task/modal information integration. To address these, we propose the Sliding-Window-level Foundation Model (SLFM) framework. SLFM introduces a novel Modal Adapter enabling zero-weight-update multimodal fusion and leverages large language models (LLMs) for semantic encoding of label textsโunifying multi-cancer, multi-task, and pan-cancer prediction. Through joint multi-task fine-tuning and cross-cancer generalization training, SLFM achieves state-of-the-art performance across four cancer types, significantly improving survival prediction and histological subtype classification accuracy. Moreover, it demonstrates strong generalization on pan-cancer benchmarks and two out-of-distribution (OOD) datasets. Key contributions include: (1) the first zero-weight-update multimodal adapter for WSI analysis; (2) LLM-driven semantic label encoding for unified cancer modeling; and (3) a robust, scalable foundation model architecture with superior cross-cancer and OOD generalization.
๐ Abstract
Prediction tasks in digital pathology are challenging due to the massive size of whole-slide images (WSIs) and the weak nature of training signals. Advances in computing, data availability, and self-supervised learning (SSL) have paved the way for slide-level foundation models (SLFMs) that can improve prediction tasks in low-data regimes. However, working with these models is challenging, with issues such as catastrophic forgetting during fine-tuning and under-utilization of shared information between tasks and modalities. To overcome these two challenges, we propose ModalTune, a novel fine-tuning framework which introduces the Modal Adapter to integrate new modalities without modifying SLFM weights. Additionally, we use large-language models (LLMs) to encode labels as text, capturing semantic relationships and enhancing generalization across multiple tasks and cancer types in a single training recipe. ModalTune achieves state-of-the-art (SOTA) results against both uni-modal and multi-modal models across four cancer types, jointly improving survival and cancer subtype prediction while remaining competitive in pan-cancer settings. Additionally, we show ModalTune is highly generalizable to two out-of-distribution (OOD) datasets. To our knowledge, this is the first unified fine-tuning framework for multi-modal, multi-task, and pan-cancer modeling in digital pathology.