🤖 AI Summary
This work addresses the challenge of effectively combining multiple pre-trained diffusion models for controllable generation when the explicit form of the target distribution is unknown. To this end, the authors formulate the multi-model composition as a cooperative stochastic optimal control problem, wherein multiple agents jointly optimize a shared objective to collaboratively steer the generative trajectories of individual models. Departing from conventional approaches that rely on algebraic operations over probability densities, the proposed framework leverages multi-agent collaboration within an optimal control paradigm, thereby eliminating the need for an explicit representation of the target distribution. Experimental results on conditional MNIST generation demonstrate that the method significantly outperforms baseline approaches such as gradient-based Diffusion Posterior Sampling (DPS).
📝 Abstract
Continuous-time generative models have achieved remarkable success in image restoration and synthesis. However, controlling the composition of multiple pre-trained models remains an open challenge. Current approaches largely treat composition as an algebraic composition of probability densities, such as via products or mixtures of experts. This perspective assumes the target distribution is known explicitly, which is almost never the case. In this work, we propose a different paradigm that formulates compositional generation as a cooperative Stochastic Optimal Control problem. Rather than combining probability densities, we treat pre-trained diffusion models as interacting agents whose diffusion trajectories are jointly steered, via optimal control, toward a shared objective defined on their aggregated output. We validate our framework on conditional MNIST generation and compare it against a naive inference-time DPS-style baseline replacing learned cooperative control with per-step gradient guidance.