π€ AI Summary
This study addresses the prohibitive computational cost of benchmarking large-scale fMRI functional connectivity models, which arises from the combinatorial explosion of modelβdata configurations and hinders routine evaluation. To overcome this challenge, the work formalizes core-set selection as an order-preserving subset selection problem and introduces a structure-aware subset selection strategy that jointly optimizes structural stability and distributional diversity to construct a small yet representative data subset capable of preserving the true model performance ranking. The proposed method leverages a self-supervised, structure-aware contrastive learning framework (SCLCS), integrating an adaptive Transformer, a structural perturbation scoring (SPS) mechanism, and density-balanced sampling. Evaluated on the REST-meta-MDD dataset, the approach accurately maintains the ground-truth model ranking using only 10% of the data, achieving up to a 23.2% improvement in nDCG@k over the current state-of-the-art.
π Abstract
Benchmarking the hundreds of functional connectivity (FC) modeling methods on large-scale fMRI datasets is critical for reproducible neuroscience. However, the combinatorial explosion of model-data pairings makes exhaustive evaluation computationally prohibitive, preventing such assessments from becoming a routine pre-analysis step. To break this bottleneck, we reframe the challenge of FC benchmarking by selecting a small, representative core-set whose sole purpose is to preserve the relative performance ranking of FC operators. We formalize this as a ranking-preserving subset selection problem and propose Structure-aware Contrastive Learning for Core-set Selection (SCLCS), a self-supervised framework to select these core-sets. SCLCS first uses an adaptive Transformer to learn each sample's unique FC structure. It then introduces a novel Structural Perturbation Score (SPS) to quantify the stability of these learned structures during training, identifying samples that represent foundational connectivity archetypes. Finally, while SCLCS identifies stable samples via a top-k ranking, we further introduce a density-balanced sampling strategy as a necessary correction to promote diversity, ensuring the final core-set is both structurally robust and distributionally representative. On the large-scale REST-meta-MDD dataset, SCLCS preserves the ground-truth model ranking with just 10% of the data, outperforming state-of-the-art (SOTA) core-set selection methods by up to 23.2% in ranking consistency (nDCG@k). To our knowledge, this is the first work to formalize core-set selection for FC operator benchmarking, thereby making large-scale operators comparisons a feasible and integral part of computational neuroscience. Code is publicly available on https://github.com/lzhan94swu/SCLCS