DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning

📅 2025-12-14

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) struggle to support fine-grained 3D perception and 4D dynamic prediction in autonomous driving. This paper introduces the first spatially aware 4D MLLM, unifying vision-language-action (VLA) understanding, 3D occupancy modeling, 4D occupancy flow prediction, and end-to-end motion planning. Our contributions are threefold: (1) a novel 4D spatially aware MLLM architecture; (2) a text-occupancy/text-flow question-answering data engine; and (3) a unified single-model paradigm compatible with both VLA and vision-action (VA) tasks. The model jointly processes LiDAR point clouds, multi-view images, and natural language instructions, built upon a lightweight Qwen2.5-0.5B backbone and co-trained for 4D occupancy/flow modeling and action generation. On nuScenes-QA, it achieves 2.5% higher accuracy than OpenDriveVLA-7B; reduces collision rate by 70% versus ORION; and outperforms specialized models—including FB-OCC and VAD—across 3D occupancy, occupancy flow, and planning metrics.

Technology Category

Application Category

📝 Abstract

Although multi-modal large language models (MLLMs) have shown strong capabilities across diverse domains, their application in generating fine-grained 3D perception and prediction outputs in autonomous driving remains underexplored. In this paper, we propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework that is also compatible with vision-action (VA) models. Our method jointly performs spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through end-to-end optimization. To obtain both precise geometric information and rich visual appearance, our approach integrates point clouds, multi-view images, and language instructions within a unified MLLM architecture. We further develop a data engine to generate text-occupancy and text-flow QA pairs for 4D spatial understanding. Remarkably, with only a 0.5B Qwen2.5 model as MLLM backbone, DrivePI as a single unified model matches or exceeds both existing VLA models and specialized VA models. Specifically, compared to VLA models, DrivePI outperforms OpenDriveVLA-7B by 2.5% mean accuracy on nuScenes-QA and reduces collision rate by 70% over ORION (from 0.37% to 0.11%) on nuScenes. Against specialized VA models, DrivePI surpasses FB-OCC by 10.3 RayIoU for 3D occupancy on OpenOcc, reduces the mAVE from 0.591 to 0.509 for occupancy flow on OpenOcc, and achieves 32% lower L2 error than VAD (from 0.72m to 0.49m) for planning on nuScenes. Code will be available at https://github.com/happinesslz/DrivePI

Problem

Research questions and friction points this paper is trying to address.

Develops a unified 4D MLLM for autonomous driving perception, prediction, and planning

Integrates point clouds, images, and language for spatial understanding and action

Matches or exceeds specialized models in accuracy and safety with a compact backbone

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified 4D MLLM integrates point clouds, images, and language for autonomous driving.

End-to-end optimization jointly handles spatial understanding, perception, prediction, and planning.

Data engine generates text-occupancy and text-flow QA pairs for 4D spatial training.

🔎 Similar Papers

OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning

2024-05-02Citations: 48

Bosch Group

$39.00 - $64.00

Sunnyvale, California / Pittsburgh, Pennsylvania / Cambridge, Massachusetts

Natural Language Processing Researcher

Kitware

Arlington, Virginia

Natural Language Processing Researcher

Kitware

Clifton Park, New York / Carrboro, North Carolina / Minneapolis, MN

Natural Language Processing Researcher

Kitware

Remote, USA: AL, AZ, CO, DC, FL, GA, IL, IN, MA, MD, ME, MN, NC, NM, NY, OH, OR, PA, TN, TX, UT, VA, WI

2026 Summer Intern, PhD, Perception

Waymo

Hourly Masters Pay$70—$70 USD; Hourly PhD Pay$85—$85 USD

Mountain View, CA, USA

Authors to Follow