🤖 AI Summary
The LUMIR challenge claims strong zero-shot generalization of deep learning models for T1-weighted MRI registration, contradicting domain shift theory. Method: We conduct the first systematic replication and evaluation of cross-modal (T2/FLAIR/T2*), cross-resolution (including 0.6 mm isotropic), and cross-species (human/macaque) generalization. Contribution/Results: While models match iterative methods on T1w and macaque data, performance degrades significantly for T2/FLAIR contrasts and high-resolution scans (Cohen’s d = 0.7–1.5). They also exhibit high sensitivity to preprocessing and poor computational scalability. Our findings reveal that current claims of zero-shot “universal superiority” are substantially overestimated. We propose a clinically realistic evaluation paradigm grounded in real-world data distributions and emphasize the necessity of standardized error metrics and effect-size analysis for rigorous benchmarking.
📝 Abstract
The LUMIR challenge represents an important benchmark for evaluating deformable image registration methods on large-scale neuroimaging data. While the challenge demonstrates that modern deep learning methods achieve competitive accuracy on T1-weighted MRI, it also claims exceptional zero-shot generalization to unseen contrasts and resolutions, assertions that contradict established understanding of domain shift in deep learning. In this paper, we perform an independent re-evaluation of these zero-shot claims using rigorous evaluation protocols while addressing potential sources of instrumentation bias. Our findings reveal a more nuanced picture: (1) deep learning methods perform comparably to iterative optimization on in-distribution T1w images and even on human-adjacent species (macaque), demonstrating improved task understanding; (2) however, performance degrades significantly on out-of-distribution contrasts (T2, T2*, FLAIR), with Cohen's d scores ranging from 0.7-1.5, indicating substantial practical impact on downstream clinical workflows; (3) deep learning methods face scalability limitations on high-resolution data, failing to run on 0.6 mm isotropic images, while iterative methods benefit from increased resolution; and (4) deep methods exhibit high sensitivity to preprocessing choices. These results align with the well-established literature on domain shift and suggest that claims of universal zero-shot superiority require careful scrutiny. We advocate for evaluation protocols that reflect practical clinical and research workflows rather than conditions that may inadvertently favor particular method classes.