KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the slow inference speed of current Vision-Language-Action (VLA) models and the challenges associated with speculative decoding, particularly high re-execution overhead and difficulty in tuning acceptance thresholds. To overcome these limitations, the study introduces robot kinematics into the speculative decoding process for the first time, proposing a kinematics-informed speculative decoding framework. This framework leverages kinematics-driven Kalman filtering to predict actions and compensate for token-level errors, while also incorporating a dynamic threshold adjustment strategy to enable an adaptive acceptance mechanism. Evaluated across diverse tasks and environments, the method achieves inference speedups of 27% to 37% with negligible degradation in task success rates.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models build a token-domain robot control paradigm, yet suffer from low speed. Speculative Decoding (SD) is an optimization strategy that can boost inference speed. Two key issues emerge when integrating VLA and SD: first, SD relies on re-inference to address token errors, which is computationally expensive; second, to mitigate token errors, the acceptance threshold in SD requires careful adjustment. Existing works fail to address the above two issues effectively. Meanwhile, as the bridge between AI and the physical world, existing embodied intelligence has overlooked the application of robotic kinematics. To address these issues, we innovatively combine token-domain VLA models with kinematic-domain prediction for SD, proposing a kinematic-rectified SD framework named KERV. We employ a kinematics-based Kalman Filter to predict actions and compensate for SD errors, avoiding costly re-inference. Moreover, we design a kinematics-based adjustment strategy to dynamically rectify the acceptance threshold, addressing the difficulty of threshold determination. Experimental results across diverse tasks and environments demonstrate that KERV achieves 27%~37% acceleration with nearly no Success Rate loss.
Problem

Research questions and friction points this paper is trying to address.

Speculative Decoding
Vision-Language-Action models
token errors
acceptance threshold
robotic kinematics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative Decoding
Kinematic Rectification
Vision-Language-Action Models
Kalman Filter
Embodied Intelligence
🔎 Similar Papers
No similar papers found.
Zihao Zheng
Zihao Zheng
Peking University
Machine Learning SystemEdge ComputingComputer ArchitectureEDA
Z
Zhihao Mao
School of Computer Science, China University of Geosciences, Wuhan, China
M
Maoliang Li
School of Computer Science, Peking University, Beijing, China
J
Jiayu Chen
School of Computer Science, Peking University, Beijing, China
X
Xinhao Sun
School of Electronics Engineering and Computer Science, Peking University, Beijing, China
Z
Zhaobo Zhang
School of Computer Science, Peking University, Beijing, China
D
Donggang Cao
School of Computer Science, Peking University, Beijing, China
Hong Mei
Hong Mei
Peking University
Software EngineeringSystem SoftwareData Analytics
X
Xiang Chen
School of Computer Science, Peking University, Beijing, China