🤖 AI Summary
Existing multimodal large language models (MLLMs) struggle to support fine-grained 3D perception and 4D dynamic prediction in autonomous driving. This paper introduces the first spatially aware 4D MLLM, unifying vision-language-action (VLA) understanding, 3D occupancy modeling, 4D occupancy flow prediction, and end-to-end motion planning. Our contributions are threefold: (1) a novel 4D spatially aware MLLM architecture; (2) a text-occupancy/text-flow question-answering data engine; and (3) a unified single-model paradigm compatible with both VLA and vision-action (VA) tasks. The model jointly processes LiDAR point clouds, multi-view images, and natural language instructions, built upon a lightweight Qwen2.5-0.5B backbone and co-trained for 4D occupancy/flow modeling and action generation. On nuScenes-QA, it achieves 2.5% higher accuracy than OpenDriveVLA-7B; reduces collision rate by 70% versus ORION; and outperforms specialized models—including FB-OCC and VAD—across 3D occupancy, occupancy flow, and planning metrics.
📝 Abstract
Although multi-modal large language models (MLLMs) have shown strong capabilities across diverse domains, their application in generating fine-grained 3D perception and prediction outputs in autonomous driving remains underexplored. In this paper, we propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework that is also compatible with vision-action (VA) models. Our method jointly performs spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through end-to-end optimization. To obtain both precise geometric information and rich visual appearance, our approach integrates point clouds, multi-view images, and language instructions within a unified MLLM architecture. We further develop a data engine to generate text-occupancy and text-flow QA pairs for 4D spatial understanding. Remarkably, with only a 0.5B Qwen2.5 model as MLLM backbone, DrivePI as a single unified model matches or exceeds both existing VLA models and specialized VA models. Specifically, compared to VLA models, DrivePI outperforms OpenDriveVLA-7B by 2.5% mean accuracy on nuScenes-QA and reduces collision rate by 70% over ORION (from 0.37% to 0.11%) on nuScenes. Against specialized VA models, DrivePI surpasses FB-OCC by 10.3 RayIoU for 3D occupancy on OpenOcc, reduces the mAVE from 0.591 to 0.509 for occupancy flow on OpenOcc, and achieves 32% lower L2 error than VAD (from 0.72m to 0.49m) for planning on nuScenes. Code will be available at https://github.com/happinesslz/DrivePI