🤖 AI Summary
High-dimensional visual generation faces challenges including prohibitive computational cost and architectural complexity in multi-stage frameworks—requiring customized diffusion formulations, cascaded models, or specialized samplers. This paper proposes a streamlined, efficient decomposition-based flow matching framework: for the first time, flow matching is decoupled and independently applied to each level of multi-scale representations (e.g., Laplacian pyramids), enabling coarse-to-fine progressive generation. The method employs only a single unified model—eliminating stage-wise transition design and model cascading—while remaining fully compatible with standard training pipelines. Evaluated on ImageNet-1K at 512×512, it achieves a 35.2% improvement in Fréchet Distance Difference (FDD), faster convergence, and significantly superior generation quality over existing progressive approaches. The core innovation lies in multi-scale decoupled flow matching modeling, which jointly optimizes efficiency, generality, and performance.
📝 Abstract
Generating high-dimensional visual modalities is a computationally intensive task. A common solution is progressive generation, where the outputs are synthesized in a coarse-to-fine spectral autoregressive manner. While diffusion models benefit from the coarse-to-fine nature of denoising, explicit multi-stage architectures are rarely adopted. These architectures have increased the complexity of the overall approach, introducing the need for a custom diffusion formulation, decomposition-dependent stage transitions, add-hoc samplers, or a model cascade. Our contribution, Decomposable Flow Matching (DFM), is a simple and effective framework for the progressive generation of visual media. DFM applies Flow Matching independently at each level of a user-defined multi-scale representation (such as Laplacian pyramid). As shown by our experiments, our approach improves visual quality for both images and videos, featuring superior results compared to prior multistage frameworks. On Imagenet-1k 512px, DFM achieves 35.2% improvements in FDD scores over the base architecture and 26.4% over the best-performing baseline, under the same training compute. When applied to finetuning of large models, such as FLUX, DFM shows faster convergence speed to the training distribution. Crucially, all these advantages are achieved with a single model, architectural simplicity, and minimal modifications to existing training pipelines.