🤖 AI Summary
Flow-based diffusion models—specifically flow matching (FM)—exhibit poorly understood training dynamics, hindering principled design and optimization.
Method: We identify and rigorously characterize a two-phase “navigation–refinement” training dynamic: an early phase dominated by data-mode mixing for global structural generalization, followed by a late phase emphasizing memorization of recent samples for local detail optimization. Leveraging the closed-form solution of the marginal velocity field, we derive the exact Oracle FM objective and integrate theoretical analysis with empirical validation.
Contribution/Results: This work is the first to formally establish and verify this two-stage mechanism. It explains the intrinsic efficacy of key techniques—including time-step shifting and classifier-free guidance—by linking them to phase-specific behavioral shifts. Our analysis clarifies the model’s evolution from generalization to memorization, offering a novel conceptual framework for understanding training dynamics. Moreover, it yields interpretable, actionable principles for architecture design and algorithmic improvement in flow-based generative modeling.
📝 Abstract
Flow-based diffusion models have emerged as a leading paradigm for training generative models across images and videos. However, their memorization-generalization behavior remains poorly understood. In this work, we revisit the flow matching (FM) objective and study its marginal velocity field, which admits a closed-form expression, allowing exact computation of the oracle FM target. Analyzing this oracle velocity field reveals that flow-based diffusion models inherently formulate a two-stage training target: an early stage guided by a mixture of data modes, and a later stage dominated by the nearest data sample. The two-stage objective leads to distinct learning behaviors: the early navigation stage generalizes across data modes to form global layouts, whereas the later refinement stage increasingly memorizes fine-grained details. Leveraging these insights, we explain the effectiveness of practical techniques such as timestep-shifted schedules, classifier-free guidance intervals, and latent space design choices. Our study deepens the understanding of diffusion model training dynamics and offers principles for guiding future architectural and algorithmic improvements.