🤖 AI Summary
This work addresses the limited precision and controllability of existing vision–language–action (VLA) models in interpreting and executing language instructions that involve dense kinematic attributes—such as direction, trajectory, orientation, and relative displacement. To this end, we propose KineVLA, a novel framework that formally defines the kinematically rich VLA task and introduces a dual-layer action representation coupled with a two-level reasoning token mechanism. This design explicitly decouples task-goal invariance from kinematic variability, enabling precise responses to kinematics-sensitive instructions. We further construct a kinematics-aware VLA dataset spanning both simulated and real robotic environments, complete with a dedicated annotation protocol. Through joint vision–language–action modeling and intermediate-variable supervision for alignment, KineVLA significantly outperforms prior methods on the LIBERO benchmark and the Realman-75 robot, demonstrating superior accuracy, controllability, and generalization in kinematically dense tasks.
📝 Abstract
In this paper, we introduce a novel kinematics-rich vision-language-action (VLA) task, in which language commands densely encode diverse kinematic attributes (such as direction, trajectory, orientation, and relative displacement) from initiation through completion, at key moments, unlike existing action instructions that capture kinematics only coarsely or partially, thereby supporting fine-grained and personalized manipulation. In this setting, where task goals remain invariant while execution trajectories must adapt to instruction-level kinematic specifications. To address this challenge, we propose KineVLA, a vision-language-action framework that explicitly decouples goal-level invariance from kinematics-level variability through a bi-level action representation and bi-level reasoning tokens to serve as explicit, supervised intermediate variables that align language and action. To support this task, we construct the kinematics-aware VLA datasets spanning both simulation and real-world robotic platforms, featuring instruction-level kinematic variations and bi-level annotations. Extensive experiments on LIBERO and a Realman-75 robot demonstrate that KineVLA consistently outperforms strong VLA baselines on kinematics-sensitive benchmarks, achieving more precise, controllable, and generalizable manipulation behaviors.