🤖 AI Summary
How can multiple pre-trained diffusion models be fused without retraining or incurring additional computational overhead? This paper proposes SuperDiff, the first framework to rigorously derive a vector field superposition theory for diffusion models from the continuity equation, enabling training-free model ensembling during generation. The method centers on a scalable Itô density estimator that supports automatic reweighting of model combinations and inherently encodes logical OR/AND operations; it further integrates Hutchinson’s trace estimator for efficient Jacobian-free divergence computation and SDE-based log-likelihood estimation. Experiments demonstrate that SuperDiff significantly improves image diversity on CIFAR-10, enhances prompt-guided editing fidelity in Stable Diffusion, and achieves performance gains in conditional molecular generation and template-free protein structure design.
📝 Abstract
The Cambrian explosion of easily accessible pre-trained diffusion models suggests a demand for methods that combine multiple different pre-trained diffusion models without incurring the significant computational burden of re-training a larger combined model. In this paper, we cast the problem of combining multiple pre-trained diffusion models at the generation stage under a novel proposed framework termed superposition. Theoretically, we derive superposition from rigorous first principles stemming from the celebrated continuity equation and design two novel algorithms tailor-made for combining diffusion models in SuperDiff. SuperDiff leverages a new scalable It^o density estimator for the log likelihood of the diffusion SDE which incurs no additional overhead compared to the well-known Hutchinson's estimator needed for divergence calculations. We demonstrate that SuperDiff is scalable to large pre-trained diffusion models as superposition is performed solely through composition during inference, and also enjoys painless implementation as it combines different pre-trained vector fields through an automated re-weighting scheme. Notably, we show that SuperDiff is efficient during inference time, and mimics traditional composition operators such as the logical OR and the logical AND. We empirically demonstrate the utility of using SuperDiff for generating more diverse images on CIFAR-10, more faithful prompt conditioned image editing using Stable Diffusion, as well as improved conditional molecule generation and unconditional de novo structure design of proteins. https://github.com/necludov/super-diffusion