🤖 AI Summary
The interplay between instantaneous and mean velocity fields in MeanFlow remains poorly understood, hindering both few-step generation quality and training efficiency. Method: We establish that learning the instantaneous velocity field is a prerequisite for mean velocity field estimation, and propose a time-interval-based curriculum learning strategy: prioritizing short-horizon instantaneous fields early to accelerate convergence, then progressively shifting focus to long-horizon mean fields for refinement—enabling dynamic decoupling and coordination. Our approach integrates the DiT architecture, coupled velocity field modeling, task affinity analysis, and phased optimization. Results: On ImageNet 256×256, our method achieves a FID of 2.87 with 1-NFE sampling—significantly improving upon the baseline (3.43). It accelerates training by 2.5× at equivalent performance or enables use of a smaller DiT-L model. This work provides the first systematic analysis and improvement of MeanFlow’s training dynamics.
📝 Abstract
MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.