🤖 AI Summary
Current one-step generative text-to-image models struggle to simultaneously achieve high fidelity, fast inference, and training efficiency. This work proposes the APEX framework, which introduces an endogenous adversarial signal mechanism for the first time: by leveraging conditional shifts within flow matching, it extracts adversarial correction signals directly from the generator itself—eliminating the need for an external discriminator and thereby avoiding associated training instability and high GPU memory overhead. The method integrates self-adversarial gradient estimation with efficient LoRA fine-tuning while preserving architectural generality. Experiments demonstrate that a 0.6B-parameter APEX model surpasses the 12B-parameter FLUX-Schnell (20× larger) in performance; moreover, when applied to Qwen-Image 20B, just six hours of LoRA fine-tuning yields a GenEval score of 0.89, outperforming the original 50-step teacher model while accelerating inference by 15.33×.
📝 Abstract
The push for efficient text to image synthesis has moved the field toward one step sampling, yet existing methods still face a three way tradeoff among fidelity, inference speed, and training efficiency. Approaches that rely on external discriminators can sharpen one step performance, but they often introduce training instability, high GPU memory overhead, and slow convergence, which complicates scaling and parameter efficient tuning. In contrast, regression based distillation and consistency objectives are easier to optimize, but they typically lose fine details when constrained to a single step. We present APEX, built on a key theoretical insight: adversarial correction signals can be extracted endogenously from a flow model through condition shifting. Using a transformation creates a shifted condition branch whose velocity field serves as an independent estimator of the model's current generation distribution, yielding a gradient that is provably GAN aligned, replacing the sample dependent discriminator terms that cause gradient vanishing. This discriminator free design is architecture preserving, making APEX a plug and play framework compatible with both full parameter and LoRA based tuning. Empirically, our 0.6B model surpasses FLUX-Schnell 12B (20$\times$ more parameters) in one step quality. With LoRA tuning on Qwen-Image 20B, APEX reaches a GenEval score of 0.89 at NFE=1 in 6 hours, surpassing the original 50-step teacher (0.87) and providing a 15.33$\times$ inference speedup. Code is available https://github.com/LINs-lab/APEX.