🤖 AI Summary
To address the weak feature extraction capability and suboptimal training paradigms of diffusion models in medical image segmentation, this paper proposes LEAF: a latent diffusion-based framework that abandons conventional noise prediction and instead directly regresses segmentation masks to reduce output variance. LEAF introduces, for the first time without modifying network architecture, feature distillation to align intermediate representations between convolutional hidden layers and Transformer-based visual encoders. It further adopts an efficient fine-tuning strategy with a frozen backbone. Evaluated on diverse multi-disease and multi-modal medical segmentation benchmarks, LEAF significantly outperforms baseline diffusion models, demonstrating both effectiveness and strong generalization. Its core innovations lie in (1) a task-adapted direct prediction paradigm—bypassing iterative denoising—and (2) a latent-state alignment mechanism that bridges architectural heterogeneity between convolutional and attention-based encoders.
📝 Abstract
Leveraging the powerful capabilities of diffusion models has yielded quite effective results in medical image segmentation tasks. However, existing methods typically transfer the original training process directly without specific adjustments for segmentation tasks. Furthermore, the commonly used pre-trained diffusion models still have deficiencies in feature extraction. Based on these considerations, we propose LEAF, a medical image segmentation model grounded in latent diffusion models. During the fine-tuning process, we replace the original noise prediction pattern with a direct prediction of the segmentation map, thereby reducing the variance of segmentation results. We also employ a feature distillation method to align the hidden states of the convolutional layers with the features from a transformer-based vision encoder. Experimental results demonstrate that our method enhances the performance of the original diffusion model across multiple segmentation datasets for different disease types. Notably, our approach does not alter the model architecture, nor does it increase the number of parameters or computation during the inference phase, making it highly efficient.