🤖 AI Summary
Flow-based models suffer from poor structural coherence and low computational efficiency when modeling long-range dependencies and high-dimensional image distributions. To address this, we propose an autoregressive flow generation paradigm: (i) a causal noise ordering scheme constructs denoising trajectory sequences, enabling autoregressive modeling of category-level long-range dependencies among preceding denoised images; (ii) a hybrid linear attention mechanism—specifically designed for denoising trajectories—enhances modeling capacity and computational efficiency while preserving invertibility; and (iii) classifier-free guidance (CFG) sampling is integrated. On ImageNet 128×128, our method achieves a FID of 4.34 (CFG=1.5), substantially outperforming SiT (9.17), and converges within 400K steps. This work pioneers the deep integration of autoregressive modeling with invertible flow networks, establishing a novel paradigm for high-fidelity image generation.
📝 Abstract
Flow models are effective at progressively generating realistic images, but they generally struggle to capture long-range dependencies during the generation process as they compress all the information from previous time steps into a single corrupted image. To address this limitation, we propose integrating autoregressive modeling -- known for its excellence in modeling complex, high-dimensional joint probability distributions -- into flow models. During training, at each step, we construct causally-ordered sequences by sampling multiple images from the same semantic category and applying different levels of noise, where images with higher noise levels serve as causal predecessors to those with lower noise levels. This design enables the model to learn broader category-level variations while maintaining proper causal relationships in the flow process. During generation, the model autoregressively conditions the previously generated images from earlier denoising steps, forming a contextual and coherent generation trajectory. Additionally, we design a customized hybrid linear attention mechanism tailored to our modeling approach to enhance computational efficiency. Our approach, termed ARFlow, under 400k training steps, achieves 14.08 FID scores on ImageNet at 128 * 128 without classifier-free guidance, reaching 4.34 FID with classifier-free guidance 1.5, significantly outperforming the previous flow-based model SiT's 9.17 FID. Extensive ablation studies demonstrate the effectiveness of our modeling strategy and chunk-wise attention design.