π€ AI Summary
Chest X-ray (CXR) imaging suffers from severe anatomical superposition due to its inherent 2D projection, limiting diagnostic accuracy and risk prediction. To address this, we propose the first self-supervised βworld modelβ for single-view CXR sequences, jointly leveraging a visual encoder and a state-transition module to implicitly model dynamic thoracic volume changes and learn 3D-anatomically grounded latent representations from multi-angle CXRs. Crucially, our method requires no 3D annotations, yet recovers interpretable volumetric information and enables cross-view representation alignment. On cardiovascular risk prediction, it significantly outperforms both supervised and existing self-supervised baselines. For classification of five common pathologies, it achieves state-of-the-art performance. Moreover, it enables high-fidelity reconstruction of volumetric context, establishing a novel paradigm for 3D semantic understanding of CXRs.
π Abstract
Chest X-rays (CXRs) are the most widely used medical imaging modality and play a pivotal role in diagnosing diseases. However, as 2D projection images, CXRs are limited by structural superposition, which constrains their effectiveness in precise disease diagnosis and risk prediction. To address the limitations of 2D CXRs, this study introduces Xray2Xray, a novel World Model that learns latent representations encoding 3D structural information from chest X-rays. Xray2Xray captures the latent representations of the chest volume by modeling the transition dynamics of X-ray projections across different angular positions with a vision model and a transition model. We employed the latent representations of Xray2Xray for downstream risk prediction and disease diagnosis tasks. Experimental results showed that Xray2Xray outperformed both supervised methods and self-supervised pretraining methods for cardiovascular disease risk estimation and achieved competitive performance in classifying five pathologies in CXRs. We also assessed the quality of Xray2Xray's latent representations through synthesis tasks and demonstrated that the latent representations can be used to reconstruct volumetric context.