On the effectiveness of multimodal privileged knowledge distillation in two vision transformer based diagnostic applications

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Clinical deployment of deep learning models often encounters missing multimodal data (e.g., text, structured reports) during inference. To address this, we propose Multimodal Privileged Knowledge Distillation (MMPKD), a framework that leverages teacher-accessible textual and structured data—available only at training time—to guide a unimodal vision student model (based on Transformers) for improved lesion localization in chest X-ray and mammography diagnosis. Our work provides the first empirical evidence that cross-modal distillation significantly enhances attention map localization accuracy; however, this gain lacks cross-domain generalizability, thereby revising prior assumptions about its transferability. Experiments demonstrate that MMPKD effectively improves interpretability and diagnostic robustness of unimodal models under zero-shot settings, while also revealing its strong context dependence—a critical limitation for real-world clinical adoption.

Technology Category

Application Category

📝 Abstract
Deploying deep learning models in clinical practice often requires leveraging multiple data modalities, such as images, text, and structured data, to achieve robust and trustworthy decisions. However, not all modalities are always available at inference time. In this work, we propose multimodal privileged knowledge distillation (MMPKD), a training strategy that utilizes additional modalities available solely during training to guide a unimodal vision model. Specifically, we used a text-based teacher model for chest radiographs (MIMIC-CXR) and a tabular metadata-based teacher model for mammography (CBIS-DDSM) to distill knowledge into a vision transformer student model. We show that MMPKD can improve the resulting attention maps' zero-shot capabilities of localizing ROI in input images, while this effect does not generalize across domains, as contrarily suggested by prior research.
Problem

Research questions and friction points this paper is trying to address.

Leveraging multiple data modalities for robust clinical decisions
Addressing missing modalities during inference time
Improving attention maps for ROI localization in images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal privileged knowledge distillation for training
Text-based teacher model for chest radiographs
Tabular metadata teacher model for mammography
🔎 Similar Papers
No similar papers found.
S
Simon Baur
Fraunhofer Heinrich-Hertz-Institut, 10587 Berlin, Germany
A
Alexandra Benova
Fraunhofer Heinrich-Hertz-Institut, 10587 Berlin, Germany; Universität Osnabrück, 49074 Osnabrück, Germany
E
Emilio Dolgener Cantú
Fraunhofer Heinrich-Hertz-Institut, 10587 Berlin, Germany
Jackie Ma
Jackie Ma
Fraunhofer HHI