🤖 AI Summary
This work addresses the inconsistency between training and inference in existing speculative decoding methods, where training optimizes only a single greedy path while inference requires verifying multiple sampled paths. To bridge this gap, we introduce variational inference into speculative decoding for the first time, reformulating draft model training as posterior inference over latent proposal paths by maximizing the marginal probability of acceptance under the target model. We propose a path-level utility function, an EM-based optimization framework, and two novel mechanisms: Adaptive Rejection Weighting (ARW) and Confidence-Aware Regularization (CAR). Experiments demonstrate that our approach achieves up to 9.6% higher speedup than EAGLE-3 and a 7.9% improvement in acceptance rate over ViSpec across various large language and multimodal models, significantly enhancing inference efficiency.
📝 Abstract
This note investigates core properties of martingales, emphasizing the measure-theoretic formulation of conditional expectation, the martingale transform, and the upcrossing lemma. These results lead to the Martingale Convergence Theorem, which we then apply to study the extinction behavior in Galton--Watson branching processes.