VPG: Visual Prefix Guidance for Autoregressive Image and Video Generation

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the degradation in generation quality caused by exposure bias and prefix drift during inference in autoregressive image and video generative models. The authors propose Contrastive Prefix Guidance, a training-free inference-time guidance method that leverages the self-consistency of the already-generated prefix as an internal signal, without requiring external semantic conditions or modifications to the training procedure. By contrasting model predictions conditioned on the original and perturbed prefixes, the method extrapolates to enhance predictions that better align with the posterior support of the prefix. It demonstrates broad applicability across architectures such as VAR, Infinity, and InfinityStar. Experimental results show consistent improvements across multiple image and video generation benchmarks, with the VAR model achieving an average FID reduction of 0.36.
📝 Abstract
Autoregressive image and video generators are trained with teacher-forced histories but must sample from their own generated prefixes at inference time, making them vulnerable to exposure bias and prefix drift. Existing remedies either modify training or apply sampling-time guidance aimed primarily at external semantic conditions, such as class labels or text prompts, rather than testing whether a next-step prediction provides strong posterior support for the generated prefix itself. We propose Visual Prefix Guidance (VPG), a training-free inference-time guidance method for autoregressive image and video generation. VPG improves next-step prediction by contrasting the model's output under the generated prefix with its output under a corrupted prefix, then extrapolating logits toward candidates that strengthen the posterior support of the generated prefix. Across class-conditional image generation with VAR, text-to-image generation with Infinity, and text-to-video generation with InfinityStar, VPG improves generation quality without retraining the base model, reducing FID on VAR by 0.36 on average and improving benchmark performance on both image and video generation.
Problem

Research questions and friction points this paper is trying to address.

exposure bias
prefix drift
autoregressive generation
image generation
video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Prefix Guidance
autoregressive generation
exposure bias
inference-time guidance
prefix drift
🔎 Similar Papers
2024-07-10arXiv.orgCitations: 3