🤖 AI Summary
This work addresses the inefficiency of high-dimensional generative flow models, which require modeling full-dimensional noise even when data exhibit low-rank structure. The authors propose AsymFlow, a novel approach that employs rank-asymmetric velocity parameterization to predict noise exclusively within a low-rank subspace while preserving full-dimensional data prediction and analytically reconstructing the complete velocity field. This enables, for the first time, efficient fine-tuning of pre-trained latent flow models into pixel space by aligning the low-rank pixel subspace with the latent space—retaining high-level semantics while optimizing only low-level details. Evaluated on ImageNet 256×256, AsymFlow achieves a state-of-the-art FID of 1.57, significantly outperforming DiT- and JiT-based diffusion models. Furthermore, fine-tuning FLUX.2 klein 9B with this method establishes new records in pixel-level text-to-image generation on HPSv3, DPG-Bench, and GenEval, markedly enhancing visual realism.
📝 Abstract
Flow-based generation in high-dimensional spaces is difficult because velocity prediction requires modeling high-dimensional noise, even when data has strong low-rank structure. We present Asymmetric Flow Modeling (AsymFlow), a rank-asymmetric velocity parameterization that restricts noise prediction to a low-rank subspace while keeping data prediction full-dimensional. From this asymmetric prediction, AsymFlow analytically recovers the full-dimensional velocity without changing the network architecture or training/sampling procedures. On ImageNet 256$\times$256, AsymFlow achieves a leading 1.57 FID, outperforming prior DiT/JiT-like pixel diffusion models by a large margin. AsymFlow also provides the first-ever route for finetuning pretrained latent flow models into pixel-space models: aligning the low-rank pixel subspace to the latent space gives a seamless initialization that preserves the latent model's high-level semantics and structure, so finetuning mainly improves low-level mismatches rather than relearning pixel generation. We show that the pixel AsymFlow model finetuned from FLUX.2 klein 9B establishes a new state of the art for pixel-space text-to-image generation, beating its latent base on HPSv3, DPG-Bench, and GenEval while qualitatively showing substantially improved visual realism.