REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization

πŸ“… 2025-10-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Visual autoregressive (AR) generative models underperform diffusion models primarily due to token-level inconsistency between the generator and tokenizer. This work is the first to identify this token-level misalignment as the fundamental bottleneck. We propose a token-level consistency regularization strategy that requires no architectural modification: within a causal Transformer, we jointly reconstruct the visual embeddings of both the current and target tokens alongside standard next-token prediction, and further inject noise into contextual representations to enhance robustness. Our method significantly improves generation qualityβ€”on ImageNet, gFID drops from 3.02 to 1.42 (with a state-of-the-art tokenizer) and Inception Score (IS) reaches 316.9. Remarkably, a 177M-parameter AR model matches the performance of a 675M-parameter diffusion model, empirically validating the effectiveness and efficiency of co-optimizing generation and tokenization.

Technology Category

Application Category

πŸ“ Abstract
Visual autoregressive (AR) generation offers a promising path toward unifying vision and language models, yet its performance remains suboptimal against diffusion models. Prior work often attributes this gap to tokenizer limitations and rasterization ordering. In this work, we identify a core bottleneck from the perspective of generator-tokenizer inconsistency, i.e., the AR-generated tokens may not be well-decoded by the tokenizer. To address this, we propose reAR, a simple training strategy introducing a token-wise regularization objective: when predicting the next token, the causal transformer is also trained to recover the visual embedding of the current token and predict the embedding of the target token under a noisy context. It requires no changes to the tokenizer, generation order, inference pipeline, or external models. Despite its simplicity, reAR substantially improves performance. On ImageNet, it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standard rasterization-based tokenizer. When applied to advanced tokenizers, it achieves a gFID of 1.42 with only 177M parameters, matching the performance with larger state-of-the-art diffusion models (675M).
Problem

Research questions and friction points this paper is trying to address.

Addresses generator-tokenizer inconsistency in visual autoregressive models
Proposes regularization to improve token prediction and visual embedding recovery
Enhances performance to match larger diffusion models with fewer parameters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Regularizes generator-tokenizer consistency in autoregressive models
Uses token-wise regularization to recover and predict embeddings
Requires no changes to tokenizer, order, or inference pipeline
πŸ”Ž Similar Papers
No similar papers found.