🤖 AI Summary
This work proposes RadJEPA, a self-supervised representation learning method for chest X-rays that operates without requiring image–text paired data. RadJEPA introduces the Joint Embedding Predictive Architecture (JEPA) to medical imaging for the first time, replacing conventional global representation alignment with predictive modeling of masked regions via latent representations. By explicitly capturing local semantic structures and eliminating reliance on language-based supervision, the method achieves substantial performance gains over existing approaches such as Rad-DINO. Evaluated across multiple downstream tasks—including disease classification, semantic segmentation, and radiology report generation—RadJEPA sets a new state-of-the-art, demonstrating its effectiveness in learning rich, transferable representations from unlabeled medical images.
📝 Abstract
Recent advances in medical vision language models guide the learning of visual representations; however, this form of supervision is constrained by the availability of paired image text data, raising the question of whether robust radiology encoders can be learned without relying on language supervision. In this work, we introduce RadJEPA, a self-supervised framework built on a Joint Embedding Predictive Architecture that learns without language supervision. Pre-trained solely on unlabeled chest X-ray images, the model learns to predict latent representations of masked image regions. This predictive objective differs fundamentally from both image text pre-training and DINO-style self-distillation: rather than aligning global representations across views or modalities, RadJEPA explicitly models latent-space prediction. We evaluate the learned encoder on disease classification, semantic segmentation, and report generation tasks. Across benchmarks, RadJEPA achieves performance exceeding state-of-the-art approaches, including Rad-DINO.