🤖 AI Summary
Existing static quantization methods struggle to accommodate the dynamic precision requirements of embodied vision-language-action (VLA) models across different temporal stages, leading to performance degradation and inefficient resource utilization. This work proposes DyQ-VLA, the first quantization framework that incorporates temporal dynamics awareness. By leveraging a kinematic proxy to trigger real-time bit-width switching and integrating a kinematics-guided module for dynamically allocating optimal bit widths, DyQ-VLA enables sensitivity-aware, stage-adaptive quantization. The approach achieves remarkable efficiency: it retains 99.5% of the original model performance while occupying only 30.9% of the memory footprint, delivers a 1.49× speedup in simulation, and accelerates real-world robotic tasks by up to 1.43×.
📝 Abstract
Vision-Language-Action (VLA) models are dominant in embodied intelligence but are constrained by inference overheads. While model quantization alleviates these bottlenecks for edge deployment, static quantization approaches remain suboptimal for VLAs due to two critical challenges: (1) Temporal-dynamic sensitivity, where fixed precision wastes resources by ignoring stage-varying error tolerances; and (2) Real-time allocation, where identifying real-time sensitivity to guide bit allocation remains unsolved. To address these challenges, we propose DyQ-VLA, a dynamic quantization framework for VLAs. Specifically, a sensitivity-aware switching strategy leverages real-time kinematic proxies to trigger the bit-width switch, while a kinematic-guided module dynamically allocates the optimal bit-width. Experiments show that DyQ-VLA requires only 30.9% of the original memory footprint while maintaining 99.5% of its original performance, achieving 1.49x simulation and up to 1.43x real-world speedups.