🤖 AI Summary
This work addresses the high latency of autoregressive image generation caused by sequential sampling and the limitations of existing speculative decoding methods, which suffer from token-level ambiguity and lack spatial awareness. To overcome these challenges, the authors propose MuLo-SD, a multi-scale local speculative decoding framework that leverages a low-resolution draft model coupled with a learnable upsampler to produce spatially coherent candidate tokens, which are then verified in parallel by a high-resolution target model. By integrating a local rejection mechanism and neighborhood-aware resampling, MuLo-SD effectively balances acceleration efficiency and generation quality. On the MS-COCO 5k validation set, MuLo-SD achieves up to 1.7× speedup—significantly outperforming baselines such as EAGLE-2 and LANTERN—while maintaining competitive semantic alignment and perceptual quality as measured by GenEval, DPG-Bench, and FID/HPSv2 metrics.
📝 Abstract
Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with learned up-samplers to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. We demonstrate that MuLo-SD achieves substantial speedups - up to $\mathbf{1.7\times}$ - outperforming strong speculative decoding baselines such as EAGLE-2 and LANTERN in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity.