🤖 AI Summary
To address the high computational cost and inference latency of vision-language models (VLMs) arising from autoregressive decoding, this paper proposes an imitation learning–based speculative decoding framework. The method employs a lightweight draft model to generate candidate token sequences, which are then verified and corrected in parallel by the full VLM via non-autoregressive processing; crucially, deep-layer features from the full model guide iterative refinement of the draft model—without requiring additional training data. Our core contribution lies in the tight integration of imitation learning with speculative decoding, enabling efficient token-level generation and correction. Experiments demonstrate that the approach retains over 98% of the original model’s performance while accelerating inference by 1.55–1.85×, significantly reducing multimodal response latency.
📝 Abstract
Vision-language Models (VLMs) have made significant strides in visual understanding and query response generation, but often face challenges of high computational cost and inference latency due to autoregressive decoding. In this work, we introduce an imitation-learning-based Self-Speculative Decoding (SSD) framework, named FastVLM, to address these limitations. Our approach employs a lightweight draft model for token generation in an autoregressive manner, while a full model verifies these tokens non-autoregressively. Accepted tokens proceed seamlessly, while rejected tokens are corrected by the full model and used to guide the draft model's refinement. Through an imitation network, FastVLM enhances the draft model by integrating deeper level insights from the full model's architecture. Also, it maintains the performance integrity of the full model while training the draft model, achieving a balance between efficiency and accuracy. Our method speeds up the inference process by 1.55-1.85x as compared to the final layer with minimal loss in performance.