TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing speculative decoding methods for large vision-language models (LVLMs) struggle to balance efficiency and robustness, often exhibiting unstable performance across diverse inputs. This work proposes a test-time adaptive batched ensemble draft generation mechanism, introducing dynamic ensemble strategies into LVLM speculative decoding for the first time. By leveraging parameter-sharing parallel draft generation, dynamically weighted fusion based on historical output deviation, and a plug-and-play verification architecture, the method significantly enhances both inference stability and speed without requiring additional training. Experiments demonstrate that the approach achieves an average 1.74× speedup over autoregressive decoding and improves performance by 5% compared to single-draft methods, with negligible ensemble overhead.

Technology Category

Application Category

📝 Abstract

Speculative decoding (SD) has proven effective for accelerating LLM inference by quickly generating draft tokens and verifying them in parallel. However, SD remains largely unexplored for Large Vision-Language Models (LVLMs), which extend LLMs to process both image and text prompts. To address this gap, we benchmark existing inference methods with small draft models on 11 datasets across diverse input scenarios and observe scenario-specific performance fluctuations. Motivated by these findings, we propose Test-time Adaptive Batched Ensemble Drafting (TABED), which dynamically ensembles multiple drafts obtained via batch inference by leveraging deviations from past ground truths available in the SD setting. The dynamic ensemble method achieves an average robust walltime speedup of 1.74x over autoregressive decoding and a 5% improvement over single drafting methods, while remaining training-free and keeping ensembling costs negligible through parameter sharing. With its plug-and-play compatibility, we further enhance TABED by integrating advanced verification and alternative drafting methods. Code and custom-trained models are available at https://github.com/furiosa-ai/TABED.

Problem

Research questions and friction points this paper is trying to address.

Speculative Decoding

Large Vision-Language Models

Robustness

Inference Acceleration

Test-Time Adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative Decoding

Large Vision-Language Models

Test-Time Adaptation