Enabling Dynamic Tracking in Vision-Language-Action Models via Time-Discrete and Time-Continuous Velocity Feedforward

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing vision-language-action (VLA) models struggle to simultaneously achieve high trajectory tracking accuracy and compliant contact safety when deployed on rigid industrial robots, primarily due to the absence of velocity feedforward information. This work introduces, for the first time, a model-agnostic velocity feedforward mechanism into the VLA framework, which is compatible with any action chunking architecture and requires only modifications to teleoperation data collection, preprocessing, and the low-level controller to unify dynamic tracking performance with compliance. Two trajectory generation strategies are proposed: a finite-difference-based time-discretized method that enhances execution speed, and a cubic B-spline-based time-continuous approach that improves higher-order smoothness. Evaluated on the highly contact-intensive “peg-in-hole” task, both methods significantly improve task performance and safety.

Technology Category

Application Category

📝 Abstract

While vision-language-action (VLA) models have shown great promise for robot manipulation, their deployment on rigid industrial robots remains challenging due to the inherent trade-off between compliance and responsiveness. Standard Behavior Cloning (BC) approaches predict discrete poses at low frequencies, omitting the velocity and acceleration feedforward terms typically used by low-level compliant controllers. This requires to rely on high stiffness for accurate tracking, thereby sacrificing safe contact dynamics. In this paper, we demonstrate the importance of integrating velocity feedforward terms into VLA policies to resolve this trade-off. We propose two methods for extracting velocity targets from VLAs: a time-discrete finite-difference approximation that serves as a highly effective bridge for existing models, and a continuous Cubic B-Spline action space that natively yields $C^2$ continuous trajectories for high-frequency control. Crucially, both approaches are strictly model-agnostic and compatible with any standard action-chunking architecture, requiring modifications only to teleoperation, data processing, and the low-level controller. We fine-tune the $π_{0.5}$ model and evaluate both of our approaches on a demanding, contact-rich cube-in-hole task. Our results indicate that incorporating the velocity feedforward term via finite differences significantly improves task execution speed, while the continuous B-Spline approach maintains high overall success rates and provides a foundation for smoother higher-order derivatives without compromising compliance.

Problem

Research questions and friction points this paper is trying to address.

vision-language-action models

velocity feedforward

compliance

robot manipulation

trajectory tracking

Innovation

Methods, ideas, or system contributions that make the work stand out.

velocity feedforward

vision-language-action models

compliant control