🤖 AI Summary
Existing controllable diffusion generation methods lack a unified theoretical framework and often rely on ad hoc heuristic strategies. This work formulates reverse diffusion sampling as a state-only stochastic control problem and establishes the first control-theoretic unified framework for diffusion guidance based on Linearly Solvable Markov Decision Processes (LS-MDPs). The framework balances target guidance against f-divergence regularization by reweighting the pretrained transition kernel, revealing that the optimal score function decomposes into a fixed baseline and a lightweight control correction term. It further introduces a reward-weighted regression objective that preserves the minimum-value property. Combined with f-divergence-regularized policy gradients (including PPO-style updates), side-network parameterization, and gray-box fine-tuning with a frozen backbone, the approach significantly improves preference alignment win rates and quality-efficiency trade-offs on Stable Diffusion v1.4, outperforming gray-box baselines and even white-box efficient adapters such as LoRA.
📝 Abstract
Controllable diffusion generation often relies on various heuristics that are seemingly disconnected without a unified understanding. We bridge this gap with Diffusion Controller (DiffCon), a unified control-theoretic view that casts reverse diffusion sampling as state-only stochastic control within (generalized) linearly-solvable Markov Decision Processes (LS-MDPs). Under this framework, control acts by reweighting the pretrained reverse-time transition kernels, balancing terminal objectives against an $f$-divergence cost. From the resulting optimality conditions, we derive practical reinforcement learning methods for diffusion fine-tuning: (i) f-divergence-regularized policy-gradient updates, including a PPO-style rule, and (ii) a regularizer-determined reward-weighted regression objective with a minimizer-preservation guarantee under the Kullback-Leibler (KL) divergence. The LS-MDP framework further implies a principled model form: the optimal score decomposes into a fixed pretrained baseline plus a lightweight control correction, motivating a side-network parameterization conditioned on exposed intermediate denoising outputs, enabling effective gray-box adaptation with a frozen backbone. Experiments on Stable Diffusion v1.4 across supervised and reward-driven finetuning show consistent gains in preference-alignment win rates and improved quality-efficiency trade-offs versus gray-box baselines and even the parameter-efficient white-box adapter LoRA.