🤖 AI Summary
Poor generalizability of existing biomedical 3D foundation models stems primarily from limited scale and insufficient anatomical, modality, and protocol coverage in public datasets. To address this, we propose the first general volumetric representation learning paradigm that requires no real 3D medical images: it synthesizes highly diverse virtual data stochastically to explicitly model domain shifts during training; integrates contrastive learning with domain-invariant feature pretraining to build a robust backbone resilient to imaging artifacts and acquisition variations. Our method achieves state-of-the-art performance on both cross-modality image registration and few-shot organ segmentation—two clinically critical yet data-hungry tasks—while entirely eliminating dependence on real medical data. This marks the first dual-task co-advancement enabled by synthetic-data-driven foundation modeling, establishing a scalable, reproducible, and resource-efficient paradigm for low-data biomedical AI.
📝 Abstract
Current volumetric biomedical foundation models struggle to generalize as public 3D datasets are small and do not cover the broad diversity of medical procedures, conditions, anatomical regions, and imaging protocols. We address this by creating a representation learning method that instead anticipates strong domain shifts at training time itself. We first propose a data engine that synthesizes highly variable training samples that would enable generalization to new biomedical contexts. To then train a single 3D network for any voxel-level task, we develop a contrastive learning method that pretrains the network to be stable against nuisance imaging variation simulated by the data engine, a key inductive bias for generalization. This network's features can be used as robust representations of input images for downstream tasks and its weights provide a strong, dataset-agnostic initialization for finetuning on new datasets. As a result, we set new standards across both multimodality registration and few-shot segmentation, a first for any 3D biomedical vision model, all without (pre-)training on any existing dataset of real images.