ModalTune: Fine-Tuning Slide-Level Foundation Models with Multi-Modal Information for Multi-task Learning in Digital Pathology

📅 2025-03-21

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Digital pathology whole-slide image (WSI) analysis faces challenges including extreme image scale, sparse annotations, catastrophic forgetting during fine-tuning, and insufficient cross-task/modal information integration. To address these, we propose the Sliding-Window-level Foundation Model (SLFM) framework. SLFM introduces a novel Modal Adapter enabling zero-weight-update multimodal fusion and leverages large language models (LLMs) for semantic encoding of label texts—unifying multi-cancer, multi-task, and pan-cancer prediction. Through joint multi-task fine-tuning and cross-cancer generalization training, SLFM achieves state-of-the-art performance across four cancer types, significantly improving survival prediction and histological subtype classification accuracy. Moreover, it demonstrates strong generalization on pan-cancer benchmarks and two out-of-distribution (OOD) datasets. Key contributions include: (1) the first zero-weight-update multimodal adapter for WSI analysis; (2) LLM-driven semantic label encoding for unified cancer modeling; and (3) a robust, scalable foundation model architecture with superior cross-cancer and OOD generalization.

Technology Category

Application Category

📝 Abstract

Prediction tasks in digital pathology are challenging due to the massive size of whole-slide images (WSIs) and the weak nature of training signals. Advances in computing, data availability, and self-supervised learning (SSL) have paved the way for slide-level foundation models (SLFMs) that can improve prediction tasks in low-data regimes. However, working with these models is challenging, with issues such as catastrophic forgetting during fine-tuning and under-utilization of shared information between tasks and modalities. To overcome these two challenges, we propose ModalTune, a novel fine-tuning framework which introduces the Modal Adapter to integrate new modalities without modifying SLFM weights. Additionally, we use large-language models (LLMs) to encode labels as text, capturing semantic relationships and enhancing generalization across multiple tasks and cancer types in a single training recipe. ModalTune achieves state-of-the-art (SOTA) results against both uni-modal and multi-modal models across four cancer types, jointly improving survival and cancer subtype prediction while remaining competitive in pan-cancer settings. Additionally, we show ModalTune is highly generalizable to two out-of-distribution (OOD) datasets. To our knowledge, this is the first unified fine-tuning framework for multi-modal, multi-task, and pan-cancer modeling in digital pathology.

Problem

Research questions and friction points this paper is trying to address.

Overcoming catastrophic forgetting in slide-level foundation model fine-tuning

Enhancing multi-task learning with shared multi-modal information

Improving generalization across cancer types and tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modal Adapter integrates new modalities without weight modification

LLMs encode labels as text for semantic relationships

Unified framework for multi-modal multi-task pan-cancer modeling

🔎 Similar Papers

A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model