🤖 AI Summary
This work proposes ProGiDiff, a novel framework that integrates pretrained diffusion models with natural language prompts to address key limitations of existing deterministic medical image segmentation methods, which lack support for natural language interaction, multi-proposal generation, and cross-modal transfer. By incorporating a ControlNet-style customized image encoder for conditional guidance, ProGiDiff generates multi-class organ segmentation masks while enabling expert-in-the-loop interactive multi-hypothesis outputs. The framework further introduces low-rank fine-tuning and few-shot adaptation strategies to facilitate efficient cross-modal transfer from CT to MRI. Experimental results demonstrate that ProGiDiff outperforms current methods on CT segmentation tasks and achieves effective generalization to the MRI domain with only a few annotated samples.
📝 Abstract
Widely adopted medical image segmentation methods, although efficient, are primarily deterministic and remain poorly amenable to natural language prompts. Thus, they lack the capability to estimate multiple proposals, human interaction, and cross-modality adaptation. Recently, text-to-image diffusion models have shown potential to bridge the gap. However, training them from scratch requires a large dataset-a limitation for medical image segmentation. Furthermore, they are often limited to binary segmentation and cannot be conditioned on a natural language prompt. To this end, we propose a novel framework called ProGiDiff that leverages existing image generation models for medical image segmentation purposes. Specifically, we propose a ControlNet-style conditioning mechanism with a custom encoder, suitable for image conditioning, to steer a pre-trained diffusion model to output segmentation masks. It naturally extends to a multi-class setting simply by prompting the target organ. Our experiment on organ segmentation from CT images demonstrates strong performance compared to previous methods and could greatly benefit from an expert-in-the-loop setting to leverage multiple proposals. Importantly, we demonstrate that the learned conditioning mechanism can be easily transferred through low-rank, few-shot adaptation to segment MR images.