HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving

📅 2026-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language-action (VLA) models for autonomous driving struggle with imprecise numerical reasoning, weak 3D spatial perception, and insufficient context sensitivity, hindering their suitability for safety-critical applications. To address these limitations, this work proposes a hierarchical spatiotemporal VLA model that enhances 3D spatiotemporal reasoning through geometry-aware fusion, fine-grained instruction conditioning, and state-history prompting. A dynamic token sparsification mechanism is introduced to maintain computational efficiency without sacrificing performance. The architecture features a hierarchical Transformer-based trajectory planner integrated with dynamic latent regularization, ensuring strict spatial anchoring of language instructions and temporal consistency in action generation. Evaluated on the NAVSIM v2 benchmark, the proposed method achieves state-of-the-art results, reporting 88.6 Navtest EPDMS and 50.9 pseudo-closed-loop Navhard EPDMS scores.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models offer promising capabilities for autonomous driving through multimodal understanding. However, their utilization in safety-critical scenarios is constrained by inherent limitations, including imprecise numerical reasoning, weak 3D spatial awareness, and high sensitivity to context. To address these challenges, we propose HiST-VLA, a novel Hierarchical Spatio-Temporal VLA model designed for reliable trajectory generation. Our framework enhances 3D spatial and temporal reasoning by integrating geometric awareness with fine-grained driving commands and state history prompting. To ensure computational efficiency, we integrate dynamic token sparsification into the VLA architecture. This approach fuses redundant tokens rather than filtering them, effectively reducing redundancy without sacrificing model performance. Furthermore, we employ a hierarchical transformer-based planner to progressively refine coarse VLA waypoints into fine-grained trajectories. Crucially, the planner utilizes dynamic latent regularization to incorporate language commands, ensuring strict spatial grounding and temporal coherence. Extensive evaluation on the NAVSIM v2 benchmark demonstrates state-of-the-art performance on Navtest, achieving an EPDMS of 88.6, and EPDMS of 50.9 on pseudo closed-loop Navhard benchmark.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
autonomous driving
3D spatial awareness
numerical reasoning
context sensitivity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Spatio-Temporal Modeling
Dynamic Token Sparsification
Vision-Language-Action (VLA)
3D Spatial Awareness
Transformer-based Planner
🔎 Similar Papers
No similar papers found.
Yiru Wang
Yiru Wang
University of Pittsburgh
Econometrics
Z
Zichong Gu
School of Communication and Information Engineering, Shanghai University, Shanghai, China
Yu Gao
Yu Gao
Unknown affiliation
AlgorithmsData structures
A
Anqing Jiang
Bosch Corporate Research, Bosch (China) Investment Ltd., Shanghai, China
Z
Zhigang Sun
Bosch Corporate Research, Bosch (China) Investment Ltd., Shanghai, China
S
Shuo Wang
Bosch Corporate Research, Bosch (China) Investment Ltd., Shanghai, China
Y
Yuwen Heng
Bosch Corporate Research, Bosch (China) Investment Ltd., Shanghai, China
H
Hao Sun
Bosch Corporate Research, Bosch (China) Investment Ltd., Shanghai, China