SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving

📅 2026-01-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing general-purpose vision-language models lack the structured spatiotemporal representation capabilities required for autonomous driving, struggling to model geometric relationships, scene context, and motion patterns. This work proposes a hierarchical world cognition architecture that establishes a three-level cognitive hierarchy—scene, agent, and goal—and, for the first time, explicitly integrates human-like driving decision-making mechanisms into the representation learning of vision-language models. Building upon a pretrained vision-language foundation, the method jointly incorporates scene understanding, modeling of key agent behaviors, and short-term goal generation to enable driving-specific reasoning. Evaluated on the NAVSIM benchmark, the proposed approach achieves state-of-the-art performance among purely visual methods on both PDMS and EPDMS metrics.

Technology Category

Application Category

📝 Abstract

Recent end-to-end autonomous driving approaches have leveraged Vision-Language Models (VLMs) to enhance planning capabilities in complex driving scenarios. However, VLMs are inherently trained as generalist models, lacking specialized understanding of driving-specific reasoning in 3D space and time. When applied to autonomous driving, these models struggle to establish structured spatial-temporal representations that capture geometric relationships, scene context, and motion patterns critical for safe trajectory planning. To address these limitations, we propose SGDrive, a novel framework that explicitly structures the VLM's representation learning around driving-specific knowledge hierarchies. Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition: drivers first perceive the overall environment (scene context), then attend to safety-critical agents and their behaviors, and finally formulate short-term goals before executing actions. This hierarchical decomposition provides the structured spatial-temporal representation that generalist VLMs lack, integrating multi-level information into a compact yet comprehensive format for trajectory planning. Extensive experiments on the NAVSIM benchmark demonstrate that SGDrive achieves state-of-the-art performance among camera-only methods on both PDMS and EPDMS, validating the effectiveness of hierarchical knowledge structuring for adapting generalist VLMs to autonomous driving.

Problem

Research questions and friction points this paper is trying to address.

autonomous driving

Vision-Language Models

spatial-temporal representation

trajectory planning

scene understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical World Cognition

Vision-Language Models

Scene-to-Goal Decomposition

Autonomous Driving

Spatial-Temporal Representation

🔎 Similar Papers

No similar papers found.

Authors to Follow