Benchmarking CXR Foundation Models With Publicly Available MIMIC-CXR and NIH-CXR14 Datasets

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Evaluating the representational capacity and cross-domain stability of chest X-ray (CXR) foundation models remains challenging due to inconsistent benchmarks and methodologies. Method: We conduct a systematic, reproducible comparison of CXR-Foundation (ELIXR v2.0) and MedImageInsight on MIMIC-CXR and NIH ChestX-ray14, using standardized preprocessing, a fixed LightGBM classifier, and unified evaluation metrics—AUROC and F1-score (95% CI)—alongside unsupervised clustering to assess disease semantic structure. Contribution/Results: MedImageInsight achieves marginally higher performance on single-dataset tasks, whereas CXR-Foundation demonstrates superior cross-dataset generalization stability. Clustering results strongly align with quantitative metrics, confirming its consistent anatomical-pathological representation learning. This work establishes the first reproducible,横向 benchmark for CXR foundation models, providing a rigorous methodological framework and reliable baseline for multimodal clinical model integration and evaluation.

Technology Category

Application Category

📝 Abstract

Recent foundation models have demonstrated strong performance in medical image representation learning, yet their comparative behaviour across datasets remains underexplored. This work benchmarks two large-scale chest X-ray (CXR) embedding models (CXR-Foundation (ELIXR v2.0) and MedImagelnsight) on public MIMIC-CR and NIH ChestX-ray14 datasets. Each model was evaluated using a unified preprocessing pipeline and fixed downstream classifiers to ensure reproducible comparison. We extracted embeddings directly from pre-trained encoders, trained lightweight LightGBM classifiers on multiple disease labels, and reported mean AUROC, and F1-score with 95% confidence intervals. MedImageInsight achieved slightly higher performance across most tasks, while CXR-Foundation exhibited strong cross-dataset stability. Unsupervised clustering of MedImageIn-sight embeddings further revealed a coherent disease-specific structure consistent with quantitative results. The results highlight the need for standardised evaluation of medical foundation models and establish reproducible baselines for future multimodal and clinical integration studies.

Problem

Research questions and friction points this paper is trying to address.

Benchmark chest X-ray foundation models on public datasets

Evaluate model performance using standardized metrics and classifiers

Compare cross-dataset stability and disease-specific embedding structures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarked two CXR embedding models on public datasets

Used unified preprocessing and fixed downstream classifiers

Evaluated embeddings with LightGBM classifiers and clustering

🔎 Similar Papers

Leveraging Foundation Models for Content-Based Medical Image Retrieval in Radiology