🤖 AI Summary
This work addresses the modeling challenge in multimodal group-wise image registration, where anatomical structures and geometric deformations are inherently coupled. We propose a Bayesian unsupervised disentanglement learning framework. Methodologically, we design a hierarchical variational autoencoder that explicitly separates two latent variables: anatomical representation (shared structural prior) and geometric deformation (subject-specific spatial transformation), and jointly estimate registration parameters via closed-loop self-reconstruction in an end-to-end manner. By abandoning handcrafted similarity metrics, our approach achieves interpretable and scalable generative registration. Evaluated on four multimodal medical imaging datasets—cardiac, brain, and abdominal—we demonstrate significant improvements over conventional similarity-driven methods: higher registration accuracy, improved computational efficiency, scalability to arbitrary cohort sizes, and visually semantically interpretable anatomical latent representations.
📝 Abstract
This article presents a general Bayesian learning framework for multi-modal groupwise image registration. The method builds on probabilistic modelling of the image generative process, where the underlying common anatomy and geometric variations of the observed images are explicitly disentangled as latent variables. Therefore, groupwise image registration is achieved via hierarchical Bayesian inference. We propose a novel hierarchical variational auto-encoding architecture to realise the inference procedure of the latent variables, where the registration parameters can be explicitly estimated in a mathematically interpretable fashion. Remarkably, this new paradigm learns groupwise image registration in an unsupervised closed-loop self-reconstruction process, sparing the burden of designing complex image-based similarity measures. The computationally efficient disentangled network architecture is also inherently scalable and flexible, allowing for groupwise registration on large-scale image groups with variable sizes. Furthermore, the inferred structural representations from multi-modal images via disentanglement learning are capable of capturing the latent anatomy of the observations with visual semantics. Extensive experiments were conducted to validate the proposed framework, including four different datasets from cardiac, brain, and abdominal medical images. The results have demonstrated the superiority of our method over conventional similarity-based approaches in terms of accuracy, efficiency, scalability, and interpretability.