VPG: Visual Prefix Guidance for Autoregressive Image and Video Generation

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the degradation in generation quality caused by exposure bias and prefix drift during inference in autoregressive image and video generative models. The authors propose Contrastive Prefix Guidance, a training-free inference-time guidance method that leverages the self-consistency of the already-generated prefix as an internal signal, without requiring external semantic conditions or modifications to the training procedure. By contrasting model predictions conditioned on the original and perturbed prefixes, the method extrapolates to enhance predictions that better align with the posterior support of the prefix. It demonstrates broad applicability across architectures such as VAR, Infinity, and InfinityStar. Experimental results show consistent improvements across multiple image and video generation benchmarks, with the VAR model achieving an average FID reduction of 0.36.

📝 Abstract

Autoregressive image and video generators are trained with teacher-forced histories but must sample from their own generated prefixes at inference time, making them vulnerable to exposure bias and prefix drift. Existing remedies either modify training or apply sampling-time guidance aimed primarily at external semantic conditions, such as class labels or text prompts, rather than testing whether a next-step prediction provides strong posterior support for the generated prefix itself. We propose Visual Prefix Guidance (VPG), a training-free inference-time guidance method for autoregressive image and video generation. VPG improves next-step prediction by contrasting the model's output under the generated prefix with its output under a corrupted prefix, then extrapolating logits toward candidates that strengthen the posterior support of the generated prefix. Across class-conditional image generation with VAR, text-to-image generation with Infinity, and text-to-video generation with InfinityStar, VPG improves generation quality without retraining the base model, reducing FID on VAR by 0.36 on average and improving benchmark performance on both image and video generation.

Problem

Research questions and friction points this paper is trying to address.

exposure bias

prefix drift

autoregressive generation

image generation

video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Prefix Guidance

autoregressive generation

exposure bias

inference-time guidance

prefix drift

🔎 Similar Papers

Video In-context Learning

2024-07-10arXiv.orgCitations: 3