🤖 AI Summary
To address the decoding inefficiency of large multimodal models (LMMs) caused by visual token redundancy, this paper proposes the first speculation-decoding framework tailored for LMMs. Our method integrates visual token compression, latent-variable modeling, multimodal alignment optimization, and speculative decoding in a unified architecture. Key innovations include: (1) a lightweight latent-aware visual token compression mechanism that explicitly models visual feature redundancy; and (2) a semi-autoregressive multi-token parallel generation strategy to improve draft model acceptance rate and throughput. Evaluated on video captioning and vision-instruction tuning tasks, our approach achieves 2.68× and 2.55× end-to-end speedup, respectively—substantially outperforming existing text-centric speculative decoding methods. This work establishes a new paradigm for efficient LMM inference.
📝 Abstract
Large language and multimodal models (LLMs and LMMs) exhibit strong inference capabilities but are often limited by slow decoding speeds. This challenge is especially acute in LMMs, where visual inputs typically comprise more tokens with lower information density than text -- an issue exacerbated by recent trends toward finer-grained visual tokenizations to boost performance. Speculative decoding has been effective in accelerating LLM inference by using a smaller draft model to generate candidate tokens, which are then selectively verified by the target model, improving speed without sacrificing output quality. While this strategy has been extended to LMMs, existing methods largely overlook the unique properties of visual inputs and depend solely on text-based draft models. In this work, we propose extbf{FLASH} (Fast Latent-Aware Semi-Autoregressive Heuristics), a speculative decoding framework designed specifically for LMMs, which leverages two key properties of multimodal data to design the draft model. First, to address redundancy in visual tokens, we propose a lightweight latent-aware token compression mechanism. Second, recognizing that visual objects often co-occur within a scene, we employ a semi-autoregressive decoding strategy to generate multiple tokens per forward pass. These innovations accelerate draft decoding while maintaining high acceptance rates, resulting in faster overall inference. Experiments show that FLASH significantly outperforms prior speculative decoding approaches in both unimodal and multimodal settings, achieving up to extbf{2.68$ imes$} speed-up on video captioning and extbf{2.55$ imes$} on visual instruction tuning tasks compared to the original LMM.