🤖 AI Summary
Evaluating the representational capacity and cross-domain stability of chest X-ray (CXR) foundation models remains challenging due to inconsistent benchmarks and methodologies.
Method: We conduct a systematic, reproducible comparison of CXR-Foundation (ELIXR v2.0) and MedImageInsight on MIMIC-CXR and NIH ChestX-ray14, using standardized preprocessing, a fixed LightGBM classifier, and unified evaluation metrics—AUROC and F1-score (95% CI)—alongside unsupervised clustering to assess disease semantic structure.
Contribution/Results: MedImageInsight achieves marginally higher performance on single-dataset tasks, whereas CXR-Foundation demonstrates superior cross-dataset generalization stability. Clustering results strongly align with quantitative metrics, confirming its consistent anatomical-pathological representation learning. This work establishes the first reproducible,横向 benchmark for CXR foundation models, providing a rigorous methodological framework and reliable baseline for multimodal clinical model integration and evaluation.
📝 Abstract
Recent foundation models have demonstrated strong performance in medical image representation learning, yet their comparative behaviour across datasets remains underexplored. This work benchmarks two large-scale chest X-ray (CXR) embedding models (CXR-Foundation (ELIXR v2.0) and MedImagelnsight) on public MIMIC-CR and NIH ChestX-ray14 datasets. Each model was evaluated using a unified preprocessing pipeline and fixed downstream classifiers to ensure reproducible comparison. We extracted embeddings directly from pre-trained encoders, trained lightweight LightGBM classifiers on multiple disease labels, and reported mean AUROC, and F1-score with 95% confidence intervals. MedImageInsight achieved slightly higher performance across most tasks, while CXR-Foundation exhibited strong cross-dataset stability. Unsupervised clustering of MedImageIn-sight embeddings further revealed a coherent disease-specific structure consistent with quantitative results. The results highlight the need for standardised evaluation of medical foundation models and establish reproducible baselines for future multimodal and clinical integration studies.