DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the challenge of high computational overhead in Vision-Language-Action (VLA) models, which hinders their real-time deployment in robotic manipulation. To mitigate this, the authors propose a dynamic-static hierarchical layer-skipping mechanism that adaptively bypasses non-critical layers based on action importance. The approach is trained using a prior-posterior skipping guidance strategy and a skip-aware two-stage knowledge distillation framework. Experimental results demonstrate that the method substantially reduces computational cost while preserving task accuracy: on the CALVIN dataset, it achieves a 2.1% higher success rate than Deer-VLA with 85.7× fewer trainable parameters, and delivers a 3.75× inference speedup over the RoboFlamingo baseline without compromising performance.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models have shown remarkable success in robotic tasks like manipulation by fusing a language model's reasoning with a vision model's 3D understanding. However, their high computational cost remains a major obstacle for real-world applications that require real-time performance. We observe that the actions within a task have varying levels of importance: critical steps demand high precision, while less important ones can tolerate more variance. Leveraging this insight, we propose DySL-VLA, a novel framework that addresses computational cost by dynamically skipping VLA layers based on each action's importance. DySL-VLA categorizes its layers into two types: informative layers, which are consistently executed, and incremental layers, which can be selectively skipped. To intelligently skip layers without sacrificing accuracy, we invent a prior-post skipping guidance mechanism to determine when to initiate layer-skipping. We also propose a skip-aware two-stage knowledge distillation algorithm to efficiently train a standard VLA into a DySL-VLA. Our experiments indicate that DySL-VLA achieves 2.1% improvement in success length over Deer-VLA on the Calvin dataset, while simultaneously reducing trainable parameters by a factor of 85.7 and providing a 3.75x speedup relative to the RoboFlamingo baseline at iso-accuracy. Our code is available on https://github.com/PKU-SEC-Lab/DYSL_VLA.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

computational cost

real-time performance

robot manipulation

model inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic-Static Layer-Skipping

Vision-Language-Action Model

Skip-Aware Knowledge Distillation