🤖 AI Summary
To address the degraded generation quality of maximum likelihood estimation (MLE)-based visual generative models under finite capacity—caused by mode coverage bias—this paper proposes the Direct Discriminative Optimization (DDO) framework. DDO theoretically reveals that likelihood-based models implicitly embody a GAN-style discriminator structure; it then explicitly models the likelihood ratio as a learnable implicit discriminator, enabling efficient fine-tuning of pre-trained diffusion and autoregressive models without joint adversarial training. The method integrates insights from Direct Preference Optimization with FID-driven unsupervised self-play refinement. On CIFAR-10 and ImageNet-64, DDO reduces the FID of the EDM model from 1.79 to 1.30 and from 1.58 to 0.97, respectively. For the ImageNet 256×256 autoregressive model, DDO significantly improves FID under both classifier-free guidance (CFG) and increased guidance scale, demonstrating broad applicability across diffusion and autoregressive paradigms.
📝 Abstract
While likelihood-based generative models, particularly diffusion and autoregressive models, have achieved remarkable fidelity in visual generation, the maximum likelihood estimation (MLE) objective inherently suffers from a mode-covering tendency that limits the generation quality under limited model capacity. In this work, we propose Direct Discriminative Optimization (DDO) as a unified framework that bridges likelihood-based generative training and the GAN objective to bypass this fundamental constraint. Our key insight is to parameterize a discriminator implicitly using the likelihood ratio between a learnable target model and a fixed reference model, drawing parallels with the philosophy of Direct Preference Optimization (DPO). Unlike GANs, this parameterization eliminates the need for joint training of generator and discriminator networks, allowing for direct, efficient, and effective finetuning of a well-trained model to its full potential beyond the limits of MLE. DDO can be performed iteratively in a self-play manner for progressive model refinement, with each round requiring less than 1% of pretraining epochs. Our experiments demonstrate the effectiveness of DDO by significantly advancing the previous SOTA diffusion model EDM, reducing FID scores from 1.79/1.58 to new records of 1.30/0.97 on CIFAR-10/ImageNet-64 datasets, and by consistently improving both guidance-free and CFG-enhanced FIDs of visual autoregressive models on ImageNet 256$ imes$256.