🤖 AI Summary
Existing vision-language models for autonomous driving struggle to simultaneously achieve robust semantic reasoning and precise 3D spatial prediction, often suffering from spatial hallucinations or representational interference. To address this, this work proposes a task-guided representation purification framework that constructs an agent-centric tokenization space and introduces a task-aware vector quantization mechanism to concentrate the limited codebook capacity on dynamically salient objects, effectively isolating spatial redundancies. Furthermore, a decoupled inference pipeline is designed to sequentially perform scene understanding, future prediction, and action generation, enhanced by a frozen 3D detection head for supervision and a three-stage progressive training strategy. The method significantly reduces collision rates in open-loop evaluation on nuScenes and sets new state-of-the-art safety performance records on the NAVSIMv1/v2 closed-loop benchmarks.
📝 Abstract
Vision-Language Models (VLMs) provide a promising foundation for autonomous driving planning, yet bridging semantic reasoning and precise 3D spatial forecasting remains a critical challenge. Existing representation strategies generally follow two paths: text-aligned methods flatten continuous spatial states into symbols, which compromises geometric structure and induces "spatial hallucinations"; dense visual methods preserve spatial topology but overwhelm standard tokenizers with redundant background textures, leading to "representation interference". To address these limitations, we introduce TPS-Drive, a novel framework centered on Task-Guided Representation Purification that empowers VLMs to Think in Purified Space. At its core, an Agent-Centric Tokenizer utilizes a task-guided vector quantization mechanism supervised by a frozen 3D detection head, which explicitly reallocates limited codebook capacity from pervasive static backgrounds to critical dynamic agents and effectively isolates spatial redundancy. Leveraging this purified spatial vocabulary, TPS-Drive employs a decoupled reasoning pipeline that sequentially performs scene understanding, future forecasting, and action generation. The framework is optimized via a progressive three-stage training paradigm, culminating in reward-driven refinement that surpasses pure imitation learning. Extensive experiments validate our approach: TPS-Drive achieves accurate agent spatial state forecasting and reduces collision rates in open-loop nuScenes evaluations, while establishing new safety records on the rigorous closed-loop NAVSIMv1 and NAVSIMv2 benchmarks.