HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing vision-language-action (VLA) models for autonomous driving struggle with imprecise numerical reasoning, weak 3D spatial perception, and insufficient context sensitivity, hindering their suitability for safety-critical applications. To address these limitations, this work proposes a hierarchical spatiotemporal VLA model that enhances 3D spatiotemporal reasoning through geometry-aware fusion, fine-grained instruction conditioning, and state-history prompting. A dynamic token sparsification mechanism is introduced to maintain computational efficiency without sacrificing performance. The architecture features a hierarchical Transformer-based trajectory planner integrated with dynamic latent regularization, ensuring strict spatial anchoring of language instructions and temporal consistency in action generation. Evaluated on the NAVSIM v2 benchmark, the proposed method achieves state-of-the-art results, reporting 88.6 Navtest EPDMS and 50.9 pseudo-closed-loop Navhard EPDMS scores.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models offer promising capabilities for autonomous driving through multimodal understanding. However, their utilization in safety-critical scenarios is constrained by inherent limitations, including imprecise numerical reasoning, weak 3D spatial awareness, and high sensitivity to context. To address these challenges, we propose HiST-VLA, a novel Hierarchical Spatio-Temporal VLA model designed for reliable trajectory generation. Our framework enhances 3D spatial and temporal reasoning by integrating geometric awareness with fine-grained driving commands and state history prompting. To ensure computational efficiency, we integrate dynamic token sparsification into the VLA architecture. This approach fuses redundant tokens rather than filtering them, effectively reducing redundancy without sacrificing model performance. Furthermore, we employ a hierarchical transformer-based planner to progressively refine coarse VLA waypoints into fine-grained trajectories. Crucially, the planner utilizes dynamic latent regularization to incorporate language commands, ensuring strict spatial grounding and temporal coherence. Extensive evaluation on the NAVSIM v2 benchmark demonstrates state-of-the-art performance on Navtest, achieving an EPDMS of 88.6, and EPDMS of 50.9 on pseudo closed-loop Navhard benchmark.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

autonomous driving

3D spatial awareness

numerical reasoning

context sensitivity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Spatio-Temporal Modeling

Dynamic Token Sparsification

Vision-Language-Action (VLA)