CASCADE: Context-Aware Relaxation for Speculative Image Decoding

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the inefficiency of autoregressive image generation, which suffers from high computational costs and low throughput due to excessive token rejection in existing speculative decoding methods—largely caused by the high uncertainty of target models. The authors present the first formal analysis of redundancy patterns in the hidden states of target models within tree-structured speculative decoding, identifying two key properties: semantic interchangeability and convergence. Leveraging these insights, they introduce a context-aware token acceptance relaxation mechanism that improves acceptance rates without requiring additional training. Furthermore, they inject redundancy-aware signals into draft model fine-tuning to enhance its predictive capability. The proposed approach achieves up to 3.6× speedup across diverse text-to-image models and draft architectures while preserving image quality and textual fidelity.

📝 Abstract

Autoregressive generation is a powerful approach for high-fidelity image synthesis, but it remains computationally demanding and slow even on the most advanced accelerators. While speculative decoding has been explored to mitigate this bottleneck, existing approaches fail to achieve efficiency gains comparable to those observed in text generation. A key limitation is the target model's high uncertainty during image generation, which leads to high draft token rejection rates. In this work, we identify previously overlooked patterns in the target model's behavior that emerge naturally in tree-based speculative decoding. Specifically, we formalize two properties, semantic interchangeability and convergence, arising from the redundancies in the target model's hidden state representations. By capturing these redundancies across the depth and breadth of the predicted token tree, our method identifies principled opportunities for acceptance relaxation without requiring additional training. Additionally, we enhance standalone drafter performance by injecting the redundancy signals from the target model into drafter training with minimal modification. We evaluate our approach across multiple text-to-image models and drafter architectures. Results show that CASCADE achieves state-of-the-art speedups for drafter-based speculative decoding, with up to 3.6x acceleration, while maintaining image quality and text-prompt fidelity.

Problem

Research questions and friction points this paper is trying to address.

speculative decoding

autoregressive image generation

token rejection

computational efficiency

image synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding

context-aware relaxation

semantic interchangeability