VA-$π$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address pixel-space misalignment in autoregressive (AR) visual generation—caused by objective inconsistency between the tokenizer and the generator—this paper proposes VA-π, a lightweight post-training framework. Methodologically, it reformulates the AR generator as a policy optimized via reinforcement learning, using pixel-level reconstruction quality as an intrinsic reward; further, it derives a unified ELBO objective that jointly optimizes both pixel reconstruction fidelity and token distribution consistency. Grounded in variational inference and policy gradient methods, VA-π requires neither tokenizer retraining nor external reward models, relying solely on teacher-forcing-based reconstruction evaluation. On LlamaGen-XXL, VA-π achieves substantial gains with only 1% ImageNet data and 25 minutes of fine-tuning: FID improves from 14.36 to 7.65, IS rises from 86.55 to 116.70, and GenEval scores for text-to-image generation show marked enhancement.

Technology Category

Application Category

📝 Abstract
Autoregressive (AR) visual generation relies on tokenizers to map images to and from discrete sequences. However, tokenizers are trained to reconstruct clean images from ground-truth tokens, while AR generators are optimized only for token likelihood. This misalignment leads to generated token sequences that may decode into low-quality images, without direct supervision from the pixel space. We propose VA-$π$, a lightweight post-training framework that directly optimizes AR models with a principled pixel-space objective. VA-$π$ formulates the generator-tokenizer alignment as a variational optimization, deriving an evidence lower bound (ELBO) that unifies pixel reconstruction and autoregressive modeling. To optimize under the discrete token space, VA-$π$ introduces a reinforcement-based alignment strategy that treats the AR generator as a policy, uses pixel-space reconstruction quality as its intrinsic reward. The reward is measured by how well the predicted token sequences can reconstruct the original image under teacher forcing, giving the model direct pixel-level guidance without expensive free-running sampling. The regularization term of the ELBO serves as a natural regularizer, maintaining distributional consistency of tokens. VA-$π$ enables rapid adaptation of existing AR generators, without neither tokenizer retraining nor external reward models. With only 1% ImageNet-1K data and 25 minutes of tuning, it reduces FID from 14.36 to 7.65 and improves IS from 86.55 to 116.70 on LlamaGen-XXL, while also yielding notable gains in the text-to-image task on GenEval for both visual generation model (LlamaGen: from 0.306 to 0.339) and unified multi-modal model (Janus-Pro: from 0.725 to 0.744). Code is available at https://github.com/Lil-Shake/VA-Pi.
Problem

Research questions and friction points this paper is trying to address.

Autoregressive image generation suffers from misalignment between token likelihood and pixel quality
Existing methods lack direct pixel-space supervision during autoregressive modeling
Tokenizers and generators are optimized separately without unified training objectives
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight post-training framework optimizes AR models with pixel-space objective.
Reinforcement-based alignment strategy uses pixel reconstruction as intrinsic reward.
Variational optimization unifies pixel reconstruction and autoregressive modeling via ELBO.
🔎 Similar Papers
No similar papers found.