DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning

📅 2025-12-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) struggle to support fine-grained 3D perception and 4D dynamic prediction in autonomous driving. This paper introduces the first spatially aware 4D MLLM, unifying vision-language-action (VLA) understanding, 3D occupancy modeling, 4D occupancy flow prediction, and end-to-end motion planning. Our contributions are threefold: (1) a novel 4D spatially aware MLLM architecture; (2) a text-occupancy/text-flow question-answering data engine; and (3) a unified single-model paradigm compatible with both VLA and vision-action (VA) tasks. The model jointly processes LiDAR point clouds, multi-view images, and natural language instructions, built upon a lightweight Qwen2.5-0.5B backbone and co-trained for 4D occupancy/flow modeling and action generation. On nuScenes-QA, it achieves 2.5% higher accuracy than OpenDriveVLA-7B; reduces collision rate by 70% versus ORION; and outperforms specialized models—including FB-OCC and VAD—across 3D occupancy, occupancy flow, and planning metrics.

Technology Category

Application Category

📝 Abstract
Although multi-modal large language models (MLLMs) have shown strong capabilities across diverse domains, their application in generating fine-grained 3D perception and prediction outputs in autonomous driving remains underexplored. In this paper, we propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework that is also compatible with vision-action (VA) models. Our method jointly performs spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through end-to-end optimization. To obtain both precise geometric information and rich visual appearance, our approach integrates point clouds, multi-view images, and language instructions within a unified MLLM architecture. We further develop a data engine to generate text-occupancy and text-flow QA pairs for 4D spatial understanding. Remarkably, with only a 0.5B Qwen2.5 model as MLLM backbone, DrivePI as a single unified model matches or exceeds both existing VLA models and specialized VA models. Specifically, compared to VLA models, DrivePI outperforms OpenDriveVLA-7B by 2.5% mean accuracy on nuScenes-QA and reduces collision rate by 70% over ORION (from 0.37% to 0.11%) on nuScenes. Against specialized VA models, DrivePI surpasses FB-OCC by 10.3 RayIoU for 3D occupancy on OpenOcc, reduces the mAVE from 0.591 to 0.509 for occupancy flow on OpenOcc, and achieves 32% lower L2 error than VAD (from 0.72m to 0.49m) for planning on nuScenes. Code will be available at https://github.com/happinesslz/DrivePI
Problem

Research questions and friction points this paper is trying to address.

Develops a unified 4D MLLM for autonomous driving perception, prediction, and planning
Integrates point clouds, images, and language for spatial understanding and action
Matches or exceeds specialized models in accuracy and safety with a compact backbone
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified 4D MLLM integrates point clouds, images, and language for autonomous driving.
End-to-end optimization jointly handles spatial understanding, perception, prediction, and planning.
Data engine generates text-occupancy and text-flow QA pairs for 4D spatial training.
🔎 Similar Papers
No similar papers found.
Z
Zhe Liu
The University of Hong Kong
Runhui Huang
Runhui Huang
The University of Hong Kong, Sun Yat-sen University
R
Rui Yang
The University of Hong Kong
S
Siming Yan
Yinwang Intelligent Technology Co. Ltd.
Zining Wang
Zining Wang
Beihang University
L
Lu Hou
Yinwang Intelligent Technology Co. Ltd.
D
Di Lin
Tianjin University
Xiang Bai
Xiang Bai
Huazhong University of Science and Technology (HUST)
Computer VisionOCR
Hengshuang Zhao
Hengshuang Zhao
The University of Hong Kong
Computer VisionMachine LearningArtificial Intelligence