Grounding Driving VLA via Inverse Kinematics

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the limitation of existing vision-language-action (VLA) models in trajectory prediction, which often bypass visual reasoning by relying solely on ego-vehicle states and textual instructions while neglecting rich visual cues. To overcome this, the authors reformulate trajectory planning as an inverse kinematics problem and introduce future visual states as boundary conditions for the first time. They propose a trajectory decoding mechanism conditioned on both current and predicted future visual states. Leveraging a large language model to forecast future visual scenes, they design a cross-attention-based conditional diffusion network to generate visually grounded trajectories. Their 0.5B-parameter model matches or exceeds the performance of 7B–8B models on NAVSIM-v2 and nuScenes closed-loop benchmarks, demonstrating substantially improved visual utilization—particularly in dynamic scenarios such as turning maneuvers.

📝 Abstract

Existing Driving VLAs predict trajectories while largely ignoring their visual tokens -- a phenomenon we trace not to insufficient training but to a structurally ill-posed task formulation. We show that trajectory recovery, when viewed through the lens of inverse kinematics, requires both a current and a future visual state as boundary conditions; existing VLAs supply only the former, which encourages the model to shortcut through ego status and text commands alone. To address this, we re-design Driving VLA in the style of an inverse kinematics solver. First, a next visual state prediction objective that requires the LLM to predict the future visual scene provides dense visual supervision and suppresses shortcut paths. Second, a separate Inverse Kinematics Network (a cross-attention-based conditional diffusion model) that takes only the current and future visual states as input is designed to suppress reliance on ego status and textual shortcuts during trajectory decoding. With this simple prescription alone, our 0.5B-scale model recovers visual grounding and reaches trajectory planning performance comparable to 7B--8B VLAs more than an order of magnitude larger, on both the closed-loop NAVSIM-v2 and the nuScenes benchmarks. Extensive analysis further shows that this improvement stems from a recovered ability to exploit visual features, with the effect being most pronounced in dynamic driving situations such as turning.

Problem

Research questions and friction points this paper is trying to address.

Driving VLA

inverse kinematics

visual grounding

trajectory prediction

visual tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inverse Kinematics

Visual Grounding

Driving VLA