HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency of existing Vision-Language-Action (VLA) models during inference and the limited integration of drafter-based and retrieval-based speculative decoding strategies. To overcome these limitations, we propose HeiSD, a novel framework that dynamically blends both speculative decoding paradigms within VLA for the first time. HeiSD introduces a verification-skipping mechanism, a sequence-level relaxed acceptance policy, and a kinematics-aware fusion metric coupled with a hybrid boundary determination strategy. Experimental results demonstrate that HeiSD achieves up to a 2.45× speedup in simulation and a 2.06–2.41× speedup in real-world settings, while maintaining high task success rates.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) Models have become the mainstream solution for robot control, but suffer from slow inference speeds. Speculative Decoding (SD) is a promising acceleration method which can be divided into two categories: drafter-based SD and retrieval-based SD. Existing methods fail to analyze the advantages and disadvantages of these two types of SD in VLA models, leading to their sole application or optimization. In this paper, we analyze the trajectory patterns of robots controlled by the VLA model and derive a key insight: the two types of SD should be used in a hybrid manner. However, achieving hybrid SD in VLA models poses several challenges: (1) draft rejection and persistent errors in retrieval-based SD; (2) difficulty in determining the hybrid boundary. To address these, we propose the HeiSD framework. We propose a retrieval-based SD optimization method in HeiSD,which contains a verify-skip mechanism and a sequence-wise relaxed acceptance strategy. Moreover, we proposed a kinematic-based fused metric in HeiSD to automatically determine the hybrid boundary. Experimental results demonstrate that HeiSD attains a speedup of up to 2.45x in simulation benchmarks and 2.06x~2.41x in real-world scenarios, while sustaining a high task success rate.
Problem

Research questions and friction points this paper is trying to address.

Speculative Decoding
Vision-Language-Action Models
Hybrid Decoding
Inference Acceleration
Kinematic Awareness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Speculative Decoding
Vision-Language-Action Models
Kinematic Awareness
Retrieval-based Speculative Decoding
Inference Acceleration
🔎 Similar Papers
Zihao Zheng
Zihao Zheng
Peking University
Machine Learning SystemEdge ComputingComputer ArchitectureEDA
Z
Zhihao Mao
School of Computer Science, China University of Geosciences, Wuhan, China
S
Sicheng Tian
School of Artificial Intelligence, Beijing Normal University, Beijing, China
M
Maoliang Li
School of Computer Science, Peking University, Beijing, China
Jiayu Chen
Jiayu Chen
PhD student, IFLab@PKU
Efficient Visual GenerationML system
X
Xinhao Sun
School of EECS, Peking University, Beijing, China
Z
Zhaobo Zhang
School of Computer Science, Peking University, Beijing, China
Xuanzhe Liu
Xuanzhe Liu
Boya Distinguished Professor, Peking University, ACM Distinguished Scientist
Machine Learning SystemMobile Computing SystemServerless Computing
D
Donggang Cao
School of Computer Science, Peking University, Beijing, China
Hong Mei
Hong Mei
Peking University
Software EngineeringSystem SoftwareData Analytics
X
Xiang Chen
School of Computer Science, Peking University, Beijing, China