RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

๐Ÿ“… 2026-05-20
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

179K/year
๐Ÿค– AI Summary
This work addresses a critical issue in discrete autoregressive text-to-image generation: when only the policy is optimized while the VQ decoder remains frozen, latent covariate shift can occur, leading to improved reward scores but degraded image quality. To resolve this, the paper introduces RankEโ€”the first end-to-end post-training framework tailored for this taskโ€”which uniquely incorporates a decoder co-evolution mechanism. By alternately optimizing both the policy and the decoder, RankE effectively breaks the trade-off between fidelity and alignment. The method integrates a ranking-based alignment objective with stability anchor regularization adapted to the parameter space, achieving simultaneous improvements in CLIP score and FID on LlamaGen-XL and Janus-Pro. These results demonstrate that decoder co-evolution successfully translates reward optimization into tangible gains in pixel-level image quality.
๐Ÿ“ Abstract
Discrete autoregressive (AR) text-to-image (T2I) models pair a VQ tokenizer with an AR policy, and current post-training pipelines optimize only the policy while keeping the VQ decoder frozen. Recent diffusion T2I work, exemplified by REPA-E, has shown that the VAE itself constitutes a key alignment bottleneck, yet no analogous investigation exists for discrete AR models. We show that policy-only optimization induces Latent Covariate Shift: as the policy evolves, the resulting token distribution diverges from the ground-truth distribution on which the decoder was trained, such that reward scores improve while decoded image quality degrades. To address this mismatch, we propose RankE, the first end-to-end post-training framework for discrete T2I generation. Rather than optimizing the policy against a fixed decoder, RankE co-evolves both components through alternating optimization: each module maximizes a ranking-based alignment objective while being regularized by a stability-preserving anchor suited to its parameter space. This co-evolution breaks the fidelity--alignment trade-off that plagues frozen-decoder approaches: on LlamaGen-XL (775M), standard RL improves CLIP but degrades FID, whereas RankE improves both simultaneously (FID 15.21, CLIP 33.76 on MS-COCO 30K). Consistent gains on Janus-Pro (1B) confirm that decoder co-evolution reliably converts reward optimization into pixel-space quality improvements.
Problem

Research questions and friction points this paper is trying to address.

Latent Covariate Shift
discrete text-to-image generation
post-training
VQ decoder
alignment bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

RankE
decoder co-evolution
latent covariate shift
end-to-end post-training
discrete text-to-image generation
๐Ÿ”Ž Similar Papers