🤖 AI Summary
Existing speculative decoding methods fail to accelerate multimodal large language models (MLLMs), primarily due to degraded draft quality caused by tight coupling between text and visual tokens. To address this, we propose Multimodal Speculative Decoding (MSD), the first speculative decoding framework for MLLMs featuring token-level modality decoupling: it separates text and visual modeling pathways in the draft model. Furthermore, we introduce a two-stage instruction-tuning strategy—first strengthening language modeling capability, then progressively injecting visual perception ability. Evaluated on LLaVA-1.5-7B and LLaVA-1.5-13B, MSD achieves up to 2.29× and 2.46× inference speedup, respectively, significantly outperforming prior MLLM speculative decoding approaches. Our method is the first to systematically resolve the speculation–target mismatch problem in multimodal settings, establishing a new paradigm for efficient MLLM inference.
📝 Abstract
This paper introduces Multimodal Speculative Decoding (MSD) to accelerate Multimodal Large Language Models (MLLMs) inference. Speculative decoding has been shown to accelerate Large Language Models (LLMs) without sacrificing accuracy. However, current speculative decoding methods for MLLMs fail to achieve the same speedup as they do for LLMs. To address this, we reimagine speculative decoding specifically for MLLMs. Our analysis of MLLM characteristics reveals two key design principles for MSD: (1) Text and visual tokens have fundamentally different characteristics and need to be processed separately during drafting. (2) Both language modeling ability and visual perception capability are crucial for the draft model. For the first principle, MSD decouples text and visual tokens in the draft model, allowing each to be handled based on its own characteristics. For the second principle, MSD uses a two-stage training strategy: In stage one, the draft model is trained on text-only instruction-tuning datasets to improve its language modeling ability. In stage two, MSD gradually introduces multimodal data to enhance the visual perception capability of the draft model. Experiments show that MSD boosts inference speed by up to $2.29 imes$ for LLaVA-1.5-7B and up to $2.46 imes$ for LLaVA-1.5-13B on multimodal benchmarks, demonstrating its effectiveness. Our code is available at https://github.com/Lyn-Lucy/MSD.