🤖 AI Summary
This work addresses the challenge of high computational overhead in Vision-Language-Action (VLA) models, which hinders their real-time deployment in robotic manipulation. To mitigate this, the authors propose a dynamic-static hierarchical layer-skipping mechanism that adaptively bypasses non-critical layers based on action importance. The approach is trained using a prior-posterior skipping guidance strategy and a skip-aware two-stage knowledge distillation framework. Experimental results demonstrate that the method substantially reduces computational cost while preserving task accuracy: on the CALVIN dataset, it achieves a 2.1% higher success rate than Deer-VLA with 85.7× fewer trainable parameters, and delivers a 3.75× inference speedup over the RoboFlamingo baseline without compromising performance.
📝 Abstract
Vision-Language-Action (VLA) models have shown remarkable success in robotic tasks like manipulation by fusing a language model's reasoning with a vision model's 3D understanding. However, their high computational cost remains a major obstacle for real-world applications that require real-time performance. We observe that the actions within a task have varying levels of importance: critical steps demand high precision, while less important ones can tolerate more variance. Leveraging this insight, we propose DySL-VLA, a novel framework that addresses computational cost by dynamically skipping VLA layers based on each action's importance. DySL-VLA categorizes its layers into two types: informative layers, which are consistently executed, and incremental layers, which can be selectively skipped. To intelligently skip layers without sacrificing accuracy, we invent a prior-post skipping guidance mechanism to determine when to initiate layer-skipping. We also propose a skip-aware two-stage knowledge distillation algorithm to efficiently train a standard VLA into a DySL-VLA. Our experiments indicate that DySL-VLA achieves 2.1% improvement in success length over Deer-VLA on the Calvin dataset, while simultaneously reducing trainable parameters by a factor of 85.7 and providing a 3.75x speedup relative to the RoboFlamingo baseline at iso-accuracy. Our code is available on https://github.com/PKU-SEC-Lab/DYSL_VLA.