🤖 AI Summary
Current text-to-image generation models are constrained by either local supervision—requiring multi-step iterative refinement—or knowledge distillation—which depends on pre-trained teacher models—thus failing to simultaneously support arbitrary-step inference and end-to-end training from scratch. This paper introduces Self-E, the first teacher-free, step-agnostic end-to-end flow matching framework. Self-E unifies local flow matching learning with global consistency optimization via a dynamic self-assessment mechanism. It further incorporates self-supervised score estimation and differentiable trajectory modeling, enabling continuous, fine-grained control of inference steps (1–50) within a single model. Experiments demonstrate that Self-E achieves superior image quality over state-of-the-art methods at 1–4 steps; matches SOTA flow matching performance at 50 steps; and exhibits monotonic performance improvement as the number of inference steps increases.
📝 Abstract
We introduce the Self-Evaluating Model (Self-E), a novel, from-scratch training approach for text-to-image generation that supports any-step inference. Self-E learns from data similarly to a Flow Matching model, while simultaneously employing a novel self-evaluation mechanism: it evaluates its own generated samples using its current score estimates, effectively serving as a dynamic self-teacher. Unlike traditional diffusion or flow models, it does not rely solely on local supervision, which typically necessitates many inference steps. Unlike distillation-based approaches, it does not require a pretrained teacher. This combination of instantaneous local learning and self-driven global matching bridges the gap between the two paradigms, enabling the training of a high-quality text-to-image model from scratch that excels even at very low step counts. Extensive experiments on large-scale text-to-image benchmarks show that Self-E not only excels in few-step generation, but is also competitive with state-of-the-art Flow Matching models at 50 steps. We further find that its performance improves monotonically as inference steps increase, enabling both ultra-fast few-step generation and high-quality long-trajectory sampling within a single unified model. To our knowledge, Self-E is the first from-scratch, any-step text-to-image model, offering a unified framework for efficient and scalable generation.