🤖 AI Summary
This work addresses the challenge of efficiently guiding diffusion models to generate high-reward samples without updating model weights, while mitigating particle degeneracy and high variance. The authors propose a trust-region-based iterative warped sequential Monte Carlo method that, during inference, learns a lookahead warping function to steer sampling trajectories toward a reward-weighted target distribution. A trust-region mechanism stabilizes warping function learning in high-dimensional spaces, and theoretical analysis establishes convergence along an escort path to the target distribution, revealing the optimal warping form for zero-variance sampling. By integrating KL-constrained path updates, weighted maximum likelihood projection, and importance reweighting, the method significantly improves alignment and substantially reduces weight variance under identical inference budgets in both text and text-to-image tasks.
📝 Abstract
We study inference-time alignment for diffusion-based generative models, aiming to steer a base model toward high-reward outputs without updating its weights. Recent Sequential Monte Carlo (SMC)-based steering methods approximate reward-tilted target distributions in a principled way, but their proposals remain largely tied to the base sampler. Since reward information is mainly used after propagation through particle reweighting and resampling, these methods can require large particle budgets and suffer from weight degeneracy and high-variance estimates. One way to reduce variance and improve particle efficiency is to iteratively learn twisting functions that provide look-ahead guidance, as in twisted SMC. However, existing learnable twisting methods are developed mainly for classical sequential inference and can be unstable when applied to diffusion-based alignment with high-dimensional state spaces and terminal, noisy, or black-box rewards. We propose Trust-Region Iterative Twisted Sequential Monte Carlo (TRI-TSMC), a trust-region framework for learning twisting functions in SMC-based inference-time alignment. Each iteration computes an exact KL-constrained update in path space, which admits a closed-form solution by tempered importance reweighting, and projects this target back to the parameterized twisted family by weighted maximum likelihood. Theoretically, we formalize the value-function interpretation of the optimal twisting function and show that it yields a zero-variance sampler. We prove that the trust-region update follows an escort path toward the target distribution, that the weighted maximum-likelihood update is a forward-KL projection, and that the path reduces residual importance-weight variance. Empirically, TRI-TSMC improves primary alignment objectives on discrete diffusion text generation and text-to-image generation under matched inference-time budgets.