🤖 AI Summary
Medical image registration often fails in low-contrast or anatomically variable regions due to the limited global semantic modeling capability of conventional local intensity-based similarity metrics. To address this, we propose a zero-shot, unsupervised 3D registration framework that requires neither fine-tuning nor task-specific training. Our method leverages multi-scale intermediate activations from a pre-trained medical latent diffusion model as robust voxel-wise descriptors, enabling voxel-level correspondence estimation via cosine similarity. We further enhance stability by integrating a local search prior and controlled noise injection. By bypassing dedicated model training, our approach significantly outperforms classical B-spline registration and achieves accuracy comparable to state-of-the-art learned methods—specifically UniGradICON—on public lung CT datasets. This work establishes a new paradigm for efficient, generalizable, and clinically deployable medical image registration.
📝 Abstract
Accurate spatial correspondence between medical images is essential for longitudinal analysis, lesion tracking, and image-guided interventions. Medical image registration methods rely on local intensity-based similarity measures, which fail to capture global semantic structure and often yield mismatches in low-contrast or anatomically variable regions. Recent advances in diffusion models suggest that their intermediate representations encode rich geometric and semantic information. We present MedDIFT, a training-free 3D correspondence framework that leverages multi-scale features from a pretrained latent medical diffusion model as voxel descriptors. MedDIFT fuses diffusion activations into rich voxel-wise descriptors and matches them via cosine similarity, with an optional local-search prior. On a publicly available lung CT dataset, MedDIFT achieves correspondence accuracy comparable to the state-of-the-art learning-based UniGradICON model and surpasses conventional B-spline-based registration, without requiring any task-specific model training. Ablation experiments confirm that multi-level feature fusion and modest diffusion noise improve performance.