A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the challenges of redundancy, class imbalance, and computational inefficiency commonly encountered in medical foundation models due to their reliance on large-scale pretraining data. The authors propose CheXficient, an active learning–driven intelligent data curation strategy that significantly enhances generalization on rare pathologies while utilizing only 22.7% of chest X-ray–report pairs and less than 27.3% of the computational budget during pretraining. Integrated within a vision–language pretraining framework, CheXficient supports zero-shot classification, cross-modal retrieval, and diverse downstream tasks. Evaluated across 20 benchmarks spanning five task categories, CheXficient matches or surpasses models trained on full datasets, demonstrating particularly strong performance in long-tailed and rare disease scenarios.

Technology Category

Application Category

📝 Abstract

Foundation models for medical imaging are typically pretrained on increasingly large datasets, following a "scale-at-all-costs" paradigm. However, this strategy faces two critical challenges: large-scale medical datasets often contain substantial redundancy and severe class imbalance that bias representation learning toward over-represented patterns, and indiscriminate training regardless of heterogeneity in data quality incurs considerable computational inefficiency. Here we demonstrate that active, principled data curation during pretraining can serve as a viable, cost-effective alternative to brute-force dataset enlargement. We introduce CheXficient, a chest X-ray (CXR) foundation model that selectively prioritizes informative training samples. CheXficient is pretrained on only 22.7% of 1,235,004 paired CXR images and reports while consuming under 27.3% of the total compute budget, yet achieving comparable or superior performance to its full-data counterpart and other large-scale pretrained models. We assess CheXficient across 20 individual benchmarks spanning 5 task types, including non-adapted off-the-shelf evaluations (zero-shot findings classification and crossmodal retrieval) and adapted downstream tasks (disease prediction, semantic segmentation, and radiology report generation). Further analyses show that CheXficient systematically prioritizes under-represented training samples, improving generalizability on long-tailed or rare conditions. Overall, our work offers practical insights into the data and computation demands for efficient pretraining and downstream adaptation of medical vision-language foundation models.

Problem

Research questions and friction points this paper is trying to address.

medical imaging

data redundancy

class imbalance

computational inefficiency

foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

data-efficient pretraining

active data curation

medical foundation model