Meta Flow Maps enable scalable reward alignment

📅 2026-01-20
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Existing control generation methods rely heavily on extensive trajectory rollouts to estimate value functions, incurring substantial computational costs. This work proposes Meta Flow Maps (MFMs), which, by extending consistency models under stochastic settings with flow matching, enable—for the first time—the single-step, differentiable sampling of multiple independent and identically distributed clean samples from any intermediate state. This capability allows for efficient value function estimation without requiring inner-loop rollouts. The framework unifies both inference-time guidance and off-policy reward fine-tuning within a single architecture. Experimental results demonstrate that a single-particle MFM-guided sampler surpasses the Best-of-1000 baseline across multiple reward metrics on ImageNet while significantly reducing computational overhead.

Technology Category

Application Category

📝 Abstract
Controlling generative models is computationally expensive. This is because optimal alignment with a reward function--whether via inference-time steering or fine-tuning--requires estimating the value function. This task demands access to the conditional posterior $p_{1|t}(x_1|x_t)$, the distribution of clean data $x_1$ consistent with an intermediate state $x_t$, a requirement that typically compels methods to resort to costly trajectory simulations. To address this bottleneck, we introduce Meta Flow Maps (MFMs), a framework extending consistency models and flow maps into the stochastic regime. MFMs are trained to perform stochastic one-step posterior sampling, generating arbitrarily many i.i.d. draws of clean data $x_1$ from any intermediate state. Crucially, these samples provide a differentiable reparametrization that unlocks efficient value function estimation. We leverage this capability to solve bottlenecks in both paradigms: enabling inference-time steering without inner rollouts, and facilitating unbiased, off-policy fine-tuning to general rewards. Empirically, our single-particle steered-MFM sampler outperforms a Best-of-1000 baseline on ImageNet across multiple rewards at a fraction of the compute.
Problem

Research questions and friction points this paper is trying to address.

reward alignment
generative models
value function estimation
conditional posterior
computational cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Meta Flow Maps
stochastic posterior sampling
reward alignment
value function estimation
generative model steering
Peter Potaptchik
Peter Potaptchik
DPhil Student, University of Oxford
Diffusion ModelsGenerative ModellingMachine LearningSampling
A
Adhi Saravanan
University of Oxford
Abbas Mammadov
Abbas Mammadov
University of Oxford
Deep LearningGenerative ModelsInverse Problems
A
Alvaro Prat
University of Oxford
M
M. S. Albergo
Harvard University, Kempner Institute
Y
Y. W. Teh
University of Oxford