SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving

📅 2026-01-09
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing general-purpose vision-language models lack the structured spatiotemporal representation capabilities required for autonomous driving, struggling to model geometric relationships, scene context, and motion patterns. This work proposes a hierarchical world cognition architecture that establishes a three-level cognitive hierarchy—scene, agent, and goal—and, for the first time, explicitly integrates human-like driving decision-making mechanisms into the representation learning of vision-language models. Building upon a pretrained vision-language foundation, the method jointly incorporates scene understanding, modeling of key agent behaviors, and short-term goal generation to enable driving-specific reasoning. Evaluated on the NAVSIM benchmark, the proposed approach achieves state-of-the-art performance among purely visual methods on both PDMS and EPDMS metrics.

Technology Category

Application Category

📝 Abstract
Recent end-to-end autonomous driving approaches have leveraged Vision-Language Models (VLMs) to enhance planning capabilities in complex driving scenarios. However, VLMs are inherently trained as generalist models, lacking specialized understanding of driving-specific reasoning in 3D space and time. When applied to autonomous driving, these models struggle to establish structured spatial-temporal representations that capture geometric relationships, scene context, and motion patterns critical for safe trajectory planning. To address these limitations, we propose SGDrive, a novel framework that explicitly structures the VLM's representation learning around driving-specific knowledge hierarchies. Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition: drivers first perceive the overall environment (scene context), then attend to safety-critical agents and their behaviors, and finally formulate short-term goals before executing actions. This hierarchical decomposition provides the structured spatial-temporal representation that generalist VLMs lack, integrating multi-level information into a compact yet comprehensive format for trajectory planning. Extensive experiments on the NAVSIM benchmark demonstrate that SGDrive achieves state-of-the-art performance among camera-only methods on both PDMS and EPDMS, validating the effectiveness of hierarchical knowledge structuring for adapting generalist VLMs to autonomous driving.
Problem

Research questions and friction points this paper is trying to address.

autonomous driving
Vision-Language Models
spatial-temporal representation
trajectory planning
scene understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical World Cognition
Vision-Language Models
Scene-to-Goal Decomposition
Autonomous Driving
Spatial-Temporal Representation
🔎 Similar Papers
No similar papers found.
J
Jingyu Li
Fudan University, Shanghai Innovation Institute
Junjie Wu
Junjie Wu
Center for High Pressure Science & Technology Advanced Research
Physics
D
Dongnan Hu
Tongji University, Shanghai Innovation Institute
X
Xiangkai Huang
Li Auto Inc.
B
Bin Sun
Li Auto Inc.
Z
Zhihui Hao
Li Auto Inc.
X
Xianpeng Lang
Li Auto Inc.
Xiatian Zhu
Xiatian Zhu
University of Surrey
Machine LearningComputer Vision
Li Zhang
Li Zhang
Professor, Fudan University & Shanghai Innovation Institute
computer visionautonomous drivingworld modelembodied ai