Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation

📅 2025-11-26

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Visual autoregressive (AR) models hold significant promise for text-to-image generation, yet existing test-time scaling strategies—such as Best-of-N—are hindered by raster-scan decoding’s lack of global layout awareness, leading to inefficient computation on erroneous decoding paths and diminishing returns. To address this, we propose GridAR: a framework that (1) partitions images into grids for block-wise progressive generation; (2) introduces anchor-guided decoding and layout-aware prompt reconstruction to provide a global structural blueprint; and (3) incorporates early candidate pruning to dynamically discard low-quality trajectories. Experiments show that GridAR with N=4 outperforms Best-of-N (N=8) by 14.4% in generation quality, reduces computational cost by 25.6%, and improves semantic preservation by 13.9% in image editing tasks. This work is the first to deeply integrate structured layout modeling with test-time scaling in visual AR models, substantially enhancing both generation efficiency and fidelity.

Technology Category

Application Category

📝 Abstract

Recent visual autoregressive (AR) models have shown promising capabilities in text-to-image generation, operating in a manner similar to large language models. While test-time computation scaling has brought remarkable success in enabling reasoning-enhanced outputs for challenging natural language tasks, its adaptation to visual AR models remains unexplored and poses unique challenges. Naively applying test-time scaling strategies such as Best-of-N can be suboptimal: they consume full-length computation on erroneous generation trajectories, while the raster-scan decoding scheme lacks a blueprint of the entire canvas, limiting scaling benefits as only a few prompt-aligned candidates are generated. To address these, we introduce GridAR, a test-time scaling framework designed to elicit the best possible results from visual AR models. GridAR employs a grid-partitioned progressive generation scheme in which multiple partial candidates for the same position are generated within a canvas, infeasible ones are pruned early, and viable ones are fixed as anchors to guide subsequent decoding. Coupled with this, we present a layout-specified prompt reformulation strategy that inspects partial views to infer a feasible layout for satisfying the prompt. The reformulated prompt then guides subsequent image generation to mitigate the blueprint deficiency. Together, GridAR achieves higher-quality results under limited test-time scaling: with N=4, it even outperforms Best-of-N (N=8) by 14.4% on T2I-CompBench++ while reducing cost by 25.6%. It also generalizes to autoregressive image editing, showing comparable edit quality and a 13.9% gain in semantic preservation on PIE-Bench over larger-N baselines.

Problem

Research questions and friction points this paper is trying to address.

Optimizing test-time computation scaling for autoregressive image generation models

Addressing inefficient computation usage in raster-scan decoding schemes

Mitigating blueprint deficiency in visual autoregressive generation processes

Innovation

Methods, ideas, or system contributions that make the work stand out.

GridAR framework enables test-time scaling for visual AR models

Grid-partitioned progressive generation prunes infeasible candidates early

Layout-specified prompt reformulation guides image generation with inferred layout

🔎 Similar Papers

No similar papers found.