FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference

📅 2025-10-26

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

To address the high computational cost and inference latency of vision-language models (VLMs) arising from autoregressive decoding, this paper proposes an imitation learning–based speculative decoding framework. The method employs a lightweight draft model to generate candidate token sequences, which are then verified and corrected in parallel by the full VLM via non-autoregressive processing; crucially, deep-layer features from the full model guide iterative refinement of the draft model—without requiring additional training data. Our core contribution lies in the tight integration of imitation learning with speculative decoding, enabling efficient token-level generation and correction. Experiments demonstrate that the approach retains over 98% of the original model’s performance while accelerating inference by 1.55–1.85×, significantly reducing multimodal response latency.

Technology Category

Application Category

📝 Abstract

Vision-language Models (VLMs) have made significant strides in visual understanding and query response generation, but often face challenges of high computational cost and inference latency due to autoregressive decoding. In this work, we introduce an imitation-learning-based Self-Speculative Decoding (SSD) framework, named FastVLM, to address these limitations. Our approach employs a lightweight draft model for token generation in an autoregressive manner, while a full model verifies these tokens non-autoregressively. Accepted tokens proceed seamlessly, while rejected tokens are corrected by the full model and used to guide the draft model's refinement. Through an imitation network, FastVLM enhances the draft model by integrating deeper level insights from the full model's architecture. Also, it maintains the performance integrity of the full model while training the draft model, achieving a balance between efficiency and accuracy. Our method speeds up the inference process by 1.55-1.85x as compared to the final layer with minimal loss in performance.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational cost in vision-language model inference

Accelerating autoregressive decoding to lower latency

Maintaining model performance while improving inference speed

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-speculative decoding with lightweight draft model

Non-autoregressive verification by full model

Imitation network integrates deeper model insights

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models