🤖 AI Summary
This study investigates how data quality heterogeneity in neuroimaging affects self-supervised pretraining: whether low-quality scans (e.g., motion artifacts, signal dropout) provide useful supervisory signals or impair representation learning. We propose a hierarchical quality-aware contrastive pretraining framework and systematically evaluate pretraining efficacy across multi-level quality tiers of brain MRI data, followed by fine-tuning on external cohorts to assess generalizability. Results show that high-quality data substantially improves downstream brain age prediction accuracy (reducing MAE by 12.7%), whereas incorporating low-quality samples degrades representation robustness. Crucially, we uncover fundamental differences between clinical neuroimaging and general computer vision regarding noise tolerance and domain transfer mechanisms. To our knowledge, this is the first work to quantitatively demonstrate that domain-adapted data curation is essential for building trustworthy foundation models in neuroimaging—providing both theoretical grounding and practical guidelines for medical AI pretraining paradigms.
📝 Abstract
Large-scale brain imaging datasets provide unprecedented opportunities for developing domain foundation models through pretraining. However, unlike natural image datasets in computer vision, these neuroimaging data often exhibit high heterogeneity in quality, ranging from well-structured scans to severely distorted or incomplete brain volumes. This raises a fundamental question: can noise or low-quality scans contribute meaningfully to pretraining, or do they instead hinder model learning? In this study, we systematically explore the role of data quality level in pretraining and its impact on downstream tasks. Specifically, we perform pretraining on datasets with different quality levels and perform fine-tuning for brain age prediction on external cohorts. Our results show significant performance differences across quality levels, revealing both opportunities and limitations. We further discuss the gap between computer vision practices and clinical neuroimaging standards, emphasizing the necessity of domain-aware curation to ensure trusted and generalizable domain-specific foundation models.