🤖 AI Summary
Autoregressive text-to-image generation suffers from slow inference due to sequential, token-by-token decoding. Existing speculative decoding methods underperform in the image domain, primarily because of the high-dimensional sampling space, difficulty in aligning draft and target model outputs, and insufficient modeling of 2D spatial structure and local dependencies. To address these challenges, we propose a spatially aware speculative decoding framework: (1) we introduce 2D positional encoding and local attention into the draft model to enhance its capacity for capturing image structure and improving output consistency; (2) we design a lightweight draft model that collaborates with a large autoregressive target model. Evaluated on multiple benchmarks, our method achieves a 1.71× inference speedup while preserving image fidelity and diversity—marking a significant breakthrough in overcoming the modeling bottlenecks of speculative decoding for image generation.
📝 Abstract
Autoregressive (AR) image generation models are capable of producing high-fidelity images but often suffer from slow inference due to their inherently sequential, token-by-token decoding process. Speculative decoding, which employs a lightweight draft model to approximate the output of a larger AR model, has shown promise in accelerating text generation without compromising quality. However, its application to image generation remains largely underexplored. The challenges stem from a significantly larger sampling space, which complicates the alignment between the draft and target model outputs, coupled with the inadequate use of the two-dimensional spatial structure inherent in images, thereby limiting the modeling of local dependencies. To overcome these challenges, we introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions. Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models, while preserving both image fidelity and diversity.