π€ AI Summary
This work addresses the challenge that autoregressive image generation models trained via maximum likelihood often struggle to balance sample quality and diversity, while existing reinforcement learning approaches are prone to mode collapse. The authors formulate the generation process as a Markov decision process and propose a policy fine-tuning framework based on Group Relative Policy Optimization (GRPO), integrating both instance-level (e.g., CLIP, HPSv2) and distribution-level reward signals. A key innovation is the introduction of leave-one-out FID (LOO-FID) as a distribution-level reward, coupled with explicit diversity promotion through exponential moving average feature moments. Adaptive entropy regularization further enables stable multi-objective optimization. Remarkably, with only a few hundred fine-tuning iterations and without Classifier-Free Guidance, the method significantly enhances both generation quality and diversity while reducing inference cost by approximately half.
π Abstract
Autoregressive (AR) models are highly effective for image generation, yet their standard maximum-likelihood estimation training lacks direct optimization for sample quality and diversity. While reinforcement learning (RL) has been used to align diffusion models, these methods typically suffer from output diversity collapse. Similarly, concurrent RL methods for AR models rely strictly on instance-level rewards, often trading off distributional coverage for quality. To address these limitations, we propose a lightweight RL framework that casts token-based AR synthesis as a Markov Decision Process, optimized via Group Relative Policy Optimization (GRPO). Our core contribution is the introduction of a novel distribution-level Leave-One-Out FID (LOO-FID) reward; by leveraging an exponential moving average of feature moments, it explicitly encourages sample diversity and prevents mode collapse during policy updates. We integrate this with composite instance-level rewards (CLIP and HPSv2) for strict semantic and perceptual fidelity, and stabilize the multi-objective learning with an adaptive entropy regularization term. Extensive experiments on LlamaGen and VQGAN architectures demonstrate clear improvements across standard quality and diversity metrics within only a few hundred tuning iterations. The results also show that the model can be updated to produce competitive samples even without Classifier-Free Guidance, and bypass its 2x inference cost.