Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Autoregressive text-to-image generation suffers from slow inference due to sequential, token-by-token decoding. Existing speculative decoding methods underperform in the image domain, primarily because of the high-dimensional sampling space, difficulty in aligning draft and target model outputs, and insufficient modeling of 2D spatial structure and local dependencies. To address these challenges, we propose a spatially aware speculative decoding framework: (1) we introduce 2D positional encoding and local attention into the draft model to enhance its capacity for capturing image structure and improving output consistency; (2) we design a lightweight draft model that collaborates with a large autoregressive target model. Evaluated on multiple benchmarks, our method achieves a 1.71× inference speedup while preserving image fidelity and diversity—marking a significant breakthrough in overcoming the modeling bottlenecks of speculative decoding for image generation.

Technology Category

Application Category

📝 Abstract

Autoregressive (AR) image generation models are capable of producing high-fidelity images but often suffer from slow inference due to their inherently sequential, token-by-token decoding process. Speculative decoding, which employs a lightweight draft model to approximate the output of a larger AR model, has shown promise in accelerating text generation without compromising quality. However, its application to image generation remains largely underexplored. The challenges stem from a significantly larger sampling space, which complicates the alignment between the draft and target model outputs, coupled with the inadequate use of the two-dimensional spatial structure inherent in images, thereby limiting the modeling of local dependencies. To overcome these challenges, we introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions. Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models, while preserving both image fidelity and diversity.

Problem

Research questions and friction points this paper is trying to address.

Accelerating slow autoregressive image generation inference speed

Overcoming large sampling space challenges in speculative image generation

Better utilizing spatial structure for improved draft model predictions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages spatial context for faster generation

Uses speculative decoding with lightweight draft model

Guides predictions with two-dimensional image structure

🔎 Similar Papers

DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation