🤖 AI Summary
Existing flow matching methods for multi-scale image generation rely on cascaded architectures and explicit re-noising, which constrain efficiency and scalability. This work proposes a parallel multi-scale flow matching framework that decomposes images into multi-scale residuals using a Laplacian pyramid and models all scales simultaneously via a hybrid Transformer architecture augmented with causal attention. The approach eliminates the need for cascading or re-noising, achieving superior sample quality on CelebA-HQ and ImageNet while offering faster inference and reduced computational overhead. Notably, it scales effectively to 1024×1024 high-resolution image generation, significantly enhancing both generative efficiency and model scalability.
📝 Abstract
In this paper, we present Laplacian multiscale flow matching (LapFlow), a novel framework that enhances flow matching by leveraging multi-scale representations for image generative modeling. Our approach decomposes images into Laplacian pyramid residuals and processes different scales in parallel through a mixture-of-transformers (MoT) architecture with causal attention mechanisms. Unlike previous cascaded approaches that require explicit renoising between scales, our model generates multi-scale representations in parallel, eliminating the need for bridging processes. The proposed multi-scale architecture not only improves generation quality but also accelerates the sampling process and promotes scaling flow matching methods. Through extensive experimentation on CelebA-HQ and ImageNet, we demonstrate that our method achieves superior sample quality with fewer GFLOPs and faster inference compared to single-scale and multi-scale flow matching baselines. The proposed model scales effectively to high-resolution generation (up to 1024$\times$1024) while maintaining lower computational overhead.