🤖 AI Summary
Existing vision-language-action (VLA) models for autonomous driving struggle with imprecise numerical reasoning, weak 3D spatial perception, and insufficient context sensitivity, hindering their suitability for safety-critical applications. To address these limitations, this work proposes a hierarchical spatiotemporal VLA model that enhances 3D spatiotemporal reasoning through geometry-aware fusion, fine-grained instruction conditioning, and state-history prompting. A dynamic token sparsification mechanism is introduced to maintain computational efficiency without sacrificing performance. The architecture features a hierarchical Transformer-based trajectory planner integrated with dynamic latent regularization, ensuring strict spatial anchoring of language instructions and temporal consistency in action generation. Evaluated on the NAVSIM v2 benchmark, the proposed method achieves state-of-the-art results, reporting 88.6 Navtest EPDMS and 50.9 pseudo-closed-loop Navhard EPDMS scores.
📝 Abstract
Vision-Language-Action (VLA) models offer promising capabilities for autonomous driving through multimodal understanding. However, their utilization in safety-critical scenarios is constrained by inherent limitations, including imprecise numerical reasoning, weak 3D spatial awareness, and high sensitivity to context. To address these challenges, we propose HiST-VLA, a novel Hierarchical Spatio-Temporal VLA model designed for reliable trajectory generation. Our framework enhances 3D spatial and temporal reasoning by integrating geometric awareness with fine-grained driving commands and state history prompting. To ensure computational efficiency, we integrate dynamic token sparsification into the VLA architecture. This approach fuses redundant tokens rather than filtering them, effectively reducing redundancy without sacrificing model performance. Furthermore, we employ a hierarchical transformer-based planner to progressively refine coarse VLA waypoints into fine-grained trajectories. Crucially, the planner utilizes dynamic latent regularization to incorporate language commands, ensuring strict spatial grounding and temporal coherence. Extensive evaluation on the NAVSIM v2 benchmark demonstrates state-of-the-art performance on Navtest, achieving an EPDMS of 88.6, and EPDMS of 50.9 on pseudo closed-loop Navhard benchmark.