🤖 AI Summary
This work addresses the challenge of incorporating anatomical prior knowledge in medical self-supervised learning (SSL). We propose the first self-supervised world model tailored for X-ray imaging, designed to approximate radiologists’ commonsense reasoning by jointly modeling three critical dimensions: local anatomical structures, global anatomical layout, and cross-domain image variations. Methodologically, we introduce a novel tri-dimensional collaborative modeling paradigm that integrates multi-scale feature disentanglement, domain-invariant representation learning, and anatomy-aware spatial constraints, coupled with a hybrid self-supervised objective combining contrastive learning and generative reconstruction. Our approach achieves significant improvements over state-of-the-art SSL methods and large-scale medical foundation models across eight medical image classification and segmentation benchmarks. Qualitative analysis demonstrates its ability to faithfully capture anatomical consistency and clinically plausible variations—overcoming key limitations of existing SSL frameworks in modeling medical priors.
📝 Abstract
Humans can develop internal world models that encode common sense knowledge, telling them how the world works and predicting the consequences of their actions. This concept has emerged as a promising direction for establishing general-purpose machine-learning models in recent preliminary works, e.g., for visual representation learning. In this paper, we present CheXWorld, the first effort towards a self-supervised world model for radiographic images. Specifically, our work develops a unified framework that simultaneously models three aspects of medical knowledge essential for qualified radiologists, including 1) local anatomical structures describing the fine-grained characteristics of local tissues (e.g., architectures, shapes, and textures); 2) global anatomical layouts describing the global organization of the human body (e.g., layouts of organs and skeletons); and 3) domain variations that encourage CheXWorld to model the transitions across different appearance domains of radiographs (e.g., varying clarity, contrast, and exposure caused by collecting radiographs from different hospitals, devices, or patients). Empirically, we design tailored qualitative and quantitative analyses, revealing that CheXWorld successfully captures these three dimensions of medical knowledge. Furthermore, transfer learning experiments across eight medical image classification and segmentation benchmarks showcase that CheXWorld significantly outperforms existing SSL methods and large-scale medical foundation models. Code&pre-trained models are available at https://github.com/LeapLabTHU/CheXWorld.