π€ AI Summary
Pretrained unconditional diffusion models face challenges in efficiently adapting to conditional generation tasks; existing approaches suffer from hyperparameter sensitivity, reliance on retraining, or dependence on closed-source model weights. This paper proposes a unified conditional sampling framework grounded in Doobβs h-transform. Its core innovation is the first systematic integration of the h-transform into diffusion modeling, realized via a lightweight DEFT (Diffusion h-Transform) paradigm: only a small number of parameters are fine-tuned to learn the conditional h-function, while the backbone diffusion model remains entirely frozen. DEFT requires no additional training, introduces no hyperparameter sensitivity, and operates seamlessly with black-box models. On image reconstruction, it achieves 1.6Γ speedup while attaining state-of-the-art perceptual quality on natural images and reconstruction accuracy on medical images. Notably, DEFT is the first method to extend successfully to protein motif scaffolding, outperforming reconstruction-guided baselines.
π Abstract
Generative modelling paradigms based on denoising diffusion processes have emerged as a leading candidate for conditional sampling in inverse problems. In many real-world applications, we often have access to large, expensively trained unconditional diffusion models, which we aim to exploit for improving conditional sampling. Most recent approaches are motivated heuristically and lack a unifying framework, obscuring connections between them. Further, they often suffer from issues such as being very sensitive to hyperparameters, being expensive to train or needing access to weights hidden behind a closed API. In this work, we unify conditional training and sampling using the mathematically well-understood Doob's h-transform. This new perspective allows us to unify many existing methods under a common umbrella. Under this framework, we propose DEFT (Doob's h-transform Efficient FineTuning), a new approach for conditional generation that simply fine-tunes a very small network to quickly learn the conditional $h$-transform, while keeping the larger unconditional network unchanged. DEFT is much faster than existing baselines while achieving state-of-the-art performance across a variety of linear and non-linear benchmarks. On image reconstruction tasks, we achieve speedups of up to 1.6$ imes$, while having the best perceptual quality on natural images and reconstruction performance on medical images. Further, we also provide initial experiments on protein motif scaffolding and outperform reconstruction guidance methods.