Structure is Supervision: Multiview Masked Autoencoders for Radiology

📅 2025-11-27

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

To address the scarcity of medical image annotations and underutilization of multi-view structural information, this paper proposes the Multi-View Masked Autoencoder (MVMAE). MVMAE jointly models the redundancy across X-ray views and clinical report text via masked image reconstruction and cross-view contrastive learning, enabling view-invariant, disease-relevant representation learning. Its key innovation lies in leveraging multi-view consistency as a self-supervised signal, while unifying vision-language collaborative training with pure visual inference. Evaluated on MIMIC-CXR, CheXpert, and PadChest, MVMAE consistently outperforms both supervised methods and state-of-the-art vision-language baselines. Notably, its variant MVMAE-V2T achieves superior performance in low-label regimes, demonstrating strong generalization capability and clinical applicability.

Technology Category

Application Category

📝 Abstract

Building robust medical machine learning systems requires pretraining strategies that exploit the intrinsic structure present in clinical data. We introduce Multiview Masked Autoencoder (MVMAE), a self-supervised framework that leverages the natural multi-view organization of radiology studies to learn view-invariant and disease-relevant representations. MVMAE combines masked image reconstruction with cross-view alignment, transforming clinical redundancy across projections into a powerful self-supervisory signal. We further extend this approach with MVMAE-V2T, which incorporates radiology reports as an auxiliary text-based learning signal to enhance semantic grounding while preserving fully vision-based inference. Evaluated on a downstream disease classification task on three large-scale public datasets, MIMIC-CXR, CheXpert, and PadChest, MVMAE consistently outperforms supervised and vision-language baselines. Furthermore, MVMAE-V2T provides additional gains, particularly in low-label regimes where structured textual supervision is most beneficial. Together, these results establish the importance of structural and textual supervision as complementary paths toward scalable, clinically grounded medical foundation models.

Problem

Research questions and friction points this paper is trying to address.

Exploits clinical data structure for robust medical ML

Learns view-invariant representations from radiology multi-views

Enhances semantic grounding with radiology report supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multiview masked autoencoder for view-invariant representations

Combines masked reconstruction with cross-view alignment

Incorporates radiology reports as auxiliary text supervision

🔎 Similar Papers

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training