π€ AI Summary
To address the scarcity of medical image annotations and underutilization of multi-view structural information, this paper proposes the Multi-View Masked Autoencoder (MVMAE). MVMAE jointly models the redundancy across X-ray views and clinical report text via masked image reconstruction and cross-view contrastive learning, enabling view-invariant, disease-relevant representation learning. Its key innovation lies in leveraging multi-view consistency as a self-supervised signal, while unifying vision-language collaborative training with pure visual inference. Evaluated on MIMIC-CXR, CheXpert, and PadChest, MVMAE consistently outperforms both supervised methods and state-of-the-art vision-language baselines. Notably, its variant MVMAE-V2T achieves superior performance in low-label regimes, demonstrating strong generalization capability and clinical applicability.
π Abstract
Building robust medical machine learning systems requires pretraining strategies that exploit the intrinsic structure present in clinical data. We introduce Multiview Masked Autoencoder (MVMAE), a self-supervised framework that leverages the natural multi-view organization of radiology studies to learn view-invariant and disease-relevant representations. MVMAE combines masked image reconstruction with cross-view alignment, transforming clinical redundancy across projections into a powerful self-supervisory signal. We further extend this approach with MVMAE-V2T, which incorporates radiology reports as an auxiliary text-based learning signal to enhance semantic grounding while preserving fully vision-based inference. Evaluated on a downstream disease classification task on three large-scale public datasets, MIMIC-CXR, CheXpert, and PadChest, MVMAE consistently outperforms supervised and vision-language baselines. Furthermore, MVMAE-V2T provides additional gains, particularly in low-label regimes where structured textual supervision is most beneficial. Together, these results establish the importance of structural and textual supervision as complementary paths toward scalable, clinically grounded medical foundation models.