A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Image registration algorithms face severe scalability bottlenecks on gigavoxel multimodal medical images (e.g., full-resolution human brain MRI), primarily constrained by GPU memory capacity and suboptimal non-GEMM computational efficiency. To address this, we propose FFDP—the first I/O-aware distributed framework tailored for large-scale medical image registration—integrating model parallelism, convolution-aware tensor sharding, and customized non-GEMM kernel fusion to overcome memory and compute limitations inherent in conventional registration pipelines. Evaluated on an 8×A6000 GPU system, FFDP registers full-brain MRI at 100-μm resolution in just one minute, achieving a 6–7× speedup over state-of-the-art baselines while reducing peak GPU memory consumption by 20%–59%. Moreover, FFDP enables single-GPU processing of image volumes up to 64× larger than prior methods, significantly enhancing the practicality and scalability of ultra-large-scale multimodal image registration.

Technology Category

Application Category

📝 Abstract
In this work, we propose FFDP, a set of IO-aware non-GEMM fused kernels supplemented with a distributed framework for image registration at unprecedented scales. Image registration is an inverse problem fundamental to biomedical and life sciences, but algorithms have not scaled in tandem with image acquisition capabilities. Our framework complements existing model parallelism techniques proposed for large-scale transformer training by optimizing non-GEMM bottlenecks and enabling convolution-aware tensor sharding. We demonstrate unprecedented capabilities by performing multimodal registration of a 100 micron ex-vivo human brain MRI volume at native resolution - an inverse problem more than 570x larger than a standard clinical datum in about a minute using only 8 A6000 GPUs. FFDP accelerates existing state-of-the-art optimization and deep learning registration pipelines by upto 6 - 7x while reducing peak memory consumption by 20 - 59%. Comparative analysis on a 250 micron dataset shows that FFDP can fit upto 64x larger problems than existing SOTA on a single GPU, and highlights both the performance and efficiency gains of FFDP compared to SOTA image registration methods.
Problem

Research questions and friction points this paper is trying to address.

Scaling image registration to unprecedented multimodal gigavoxel scales
Optimizing non-GEMM bottlenecks in distributed biomedical image processing
Reducing memory consumption while accelerating registration pipelines
Innovation

Methods, ideas, or system contributions that make the work stand out.

IO-aware non-GEMM fused kernels for optimization
Distributed framework enabling convolution-aware tensor sharding
Accelerates registration pipelines while reducing memory consumption