🤖 AI Summary
Current VLM-driven end-to-end autonomous driving methods face a critical bottleneck in fine-grained 3D spatial relationship understanding, limiting reliable interaction with the physical world. To address this, we propose an explicit 3D spatial modeling framework that encodes continuous 3D coordinates into differentiable positional embeddings (PEs), jointly fusing multi-view depth, ego-vehicle historical states, and textual prompts. Notably, we pioneer the use of continuous PEs—instead of discrete tokens—to represent spatial coordinates, enabling task-agnostic spatial indexing and direct end-to-end regression of trajectory coordinates. Our method comprises: multi-view depth estimation; a universal position encoder; joint vision-language fine-tuning of the VLM; 3D PE augmentation of 2D visual tokens; and a coordinate-regression-based planning head. On nuScenes open-loop evaluation, it achieves state-of-the-art performance; in closed-loop Bench2Drive testing, it scores 78.02—ranking second among all VLM-based approaches.
📝 Abstract
End-to-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D visual tokens. Meanwhile, they serve as a task-agnostic coordinate representation, replacing the digit-wise numerical tokens as both inputs and outputs for the VLM. This mechanism enables the model to better index specific visual semantics in spatial reasoning and directly regress trajectory coordinates rather than generating digit-by-digit, thereby enhancing planning accuracy. Extensive experiments validate that SpaceDrive achieves state-of-the-art open-loop performance on the nuScenes dataset and the second-best Driving Score of 78.02 on the Bench2Drive closed-loop benchmark over existing VLM-based methods.