🤖 AI Summary
Existing general-purpose vision-language models lack the structured spatiotemporal representation capabilities required for autonomous driving, struggling to model geometric relationships, scene context, and motion patterns. This work proposes a hierarchical world cognition architecture that establishes a three-level cognitive hierarchy—scene, agent, and goal—and, for the first time, explicitly integrates human-like driving decision-making mechanisms into the representation learning of vision-language models. Building upon a pretrained vision-language foundation, the method jointly incorporates scene understanding, modeling of key agent behaviors, and short-term goal generation to enable driving-specific reasoning. Evaluated on the NAVSIM benchmark, the proposed approach achieves state-of-the-art performance among purely visual methods on both PDMS and EPDMS metrics.
📝 Abstract
Recent end-to-end autonomous driving approaches have leveraged Vision-Language Models (VLMs) to enhance planning capabilities in complex driving scenarios. However, VLMs are inherently trained as generalist models, lacking specialized understanding of driving-specific reasoning in 3D space and time. When applied to autonomous driving, these models struggle to establish structured spatial-temporal representations that capture geometric relationships, scene context, and motion patterns critical for safe trajectory planning. To address these limitations, we propose SGDrive, a novel framework that explicitly structures the VLM's representation learning around driving-specific knowledge hierarchies. Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition: drivers first perceive the overall environment (scene context), then attend to safety-critical agents and their behaviors, and finally formulate short-term goals before executing actions. This hierarchical decomposition provides the structured spatial-temporal representation that generalist VLMs lack, integrating multi-level information into a compact yet comprehensive format for trajectory planning. Extensive experiments on the NAVSIM benchmark demonstrate that SGDrive achieves state-of-the-art performance among camera-only methods on both PDMS and EPDMS, validating the effectiveness of hierarchical knowledge structuring for adapting generalist VLMs to autonomous driving.