Inference-Time Scaling for Visual AutoRegressive Modeling by Searching Representative Samples

πŸ“… 2026-01-12
πŸ›οΈ Chinese Conference on Pattern Recognition and Computer Vision
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitation imposed by discrete latent spaces on continuous path exploration during inference-time scaling of visual autoregressive models, which constrains generation quality. The authors propose VAR-Scaling, the first framework enabling inference-time scaling for such models, by mapping the discrete sampling space to an approximately continuous feature space via kernel density estimation. They further introduce a density-adaptive hybrid sampling strategy that combines Top-k and Random-k selection to optimize sample fidelity at critical scales. This approach overcomes the constraints of discrete representations, enabling efficient sampling navigation and significantly improving generation quality and fidelity in both class-conditional and text-to-image tasks. The method also reveals a hierarchical optimization mechanism governing both general and task-specific generation patterns.

Technology Category

Application Category

πŸ“ Abstract
While inference-time scaling has significantly enhanced generative quality in large language and diffusion models, its application to vector-quantized (VQ) visual autoregressive modeling (VAR) remains unexplored. We introduce VAR-Scaling, the first general framework for inference-time scaling in VAR, addressing the critical challenge of discrete latent spaces that prohibit continuous path search. We find that VAR scales exhibit two distinct pattern types: general patterns and specific patterns, where later-stage specific patterns conditionally optimize early-stage general patterns. To overcome the discrete latent space barrier in VQ models, we map sampling spaces to quasi-continuous feature spaces via kernel density estimation (KDE), where high-density samples approximate stable, high-quality solutions. This transformation enables effective navigation of sampling distributions. We propose a density-adaptive hybrid sampling strategy: Top-k sampling focuses on high-density regions to preserve quality near distribution modes, while Random-k sampling explores low-density areas to maintain diversity and prevent premature convergence. Consequently, VAR-Scaling optimizes sample fidelity at critical scales to enhance output quality. Experiments in class-conditional and text-to-image evaluations demonstrate significant improvements in inference process. The code is available at https://github.com/WD7ang/VAR-Scaling.
Problem

Research questions and friction points this paper is trying to address.

inference-time scaling
visual autoregressive modeling
discrete latent space
vector-quantized models
generative quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

inference-time scaling
visual autoregressive modeling
discrete latent space
kernel density estimation
hybrid sampling
πŸ”Ž Similar Papers
No similar papers found.
W
Weidong Tang
School of Electronic Engineering, Xidian University, Xi’an, China
X
Xinyan Wan
School of Electronic Engineering, Xidian University, Xi’an, China
Siyu Li
Siyu Li
University of Illinois at Chicago
RoboticsMicro-robot swarmsHuman-robot InteractionControl and Motion Planning
Xiumei Wang
Xiumei Wang
Xidian University
machine learningimage processing