Multi-Scale Local Speculative Decoding for Image Generation

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the high latency of autoregressive image generation caused by sequential sampling and the limitations of existing speculative decoding methods, which suffer from token-level ambiguity and lack spatial awareness. To overcome these challenges, the authors propose MuLo-SD, a multi-scale local speculative decoding framework that leverages a low-resolution draft model coupled with a learnable upsampler to produce spatially coherent candidate tokens, which are then verified in parallel by a high-resolution target model. By integrating a local rejection mechanism and neighborhood-aware resampling, MuLo-SD effectively balances acceleration efficiency and generation quality. On the MS-COCO 5k validation set, MuLo-SD achieves up to 1.7× speedup—significantly outperforming baselines such as EAGLE-2 and LANTERN—while maintaining competitive semantic alignment and perceptual quality as measured by GenEval, DPG-Bench, and FID/HPSv2 metrics.

Technology Category

Application Category

📝 Abstract

Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with learned up-samplers to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. We demonstrate that MuLo-SD achieves substantial speedups - up to $\mathbf{1.7\times}$ - outperforming strong speculative decoding baselines such as EAGLE-2 and LANTERN in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity.

Problem

Research questions and friction points this paper is trying to address.

Autoregressive Image Generation

Speculative Decoding

Latency Reduction

Spatial Awareness

Token Ambiguity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative Decoding

Multi-Scale Generation

Local Resampling