OccVLA: Vision-Language-Action Model with Implicit 3D Occupancy Supervision

📅 2025-09-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) face significant bottlenecks in 3D spatial understanding required for autonomous driving: (1) high cost and difficulty in constructing effective 3D representations from 2D inputs, and (2) absence of large-scale 3D vision–language pretraining, leading to loss of fine-grained spatial information. To address this, we propose an implicit 3D occupancy supervision framework that operates solely on 2D images and jointly learns spatial structure and language-to-action mapping within an MLLM. Crucially, dense 3D occupancy is introduced as an intermediate, training-only supervision signal—omitted during inference with zero performance degradation—ensuring efficiency, interpretability, and pure-vision scalability. On the nuScenes trajectory planning benchmark, our method achieves state-of-the-art performance; it also significantly outperforms existing approaches on 3D visual question answering, demonstrating superior spatial reasoning and multimodal fusion capabilities.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have shown strong vision-language reasoning abilities but still lack robust 3D spatial understanding, which is critical for autonomous driving. This limitation stems from two key challenges: (1) the difficulty of constructing accessible yet effective 3D representations without expensive manual annotations, and (2) the loss of fine-grained spatial details in VLMs due to the absence of large-scale 3D vision-language pretraining. To address these challenges, we propose OccVLA, a novel framework that integrates 3D occupancy representations into a unified multimodal reasoning process. Unlike prior approaches that rely on explicit 3D inputs, OccVLA treats dense 3D occupancy as both a predictive output and a supervisory signal, enabling the model to learn fine-grained spatial structures directly from 2D visual inputs. The occupancy predictions are regarded as implicit reasoning processes and can be skipped during inference without performance degradation, thereby adding no extra computational overhead. OccVLA achieves state-of-the-art results on the nuScenes benchmark for trajectory planning and demonstrates superior performance on 3D visual question-answering tasks, offering a scalable, interpretable, and fully vision-based solution for autonomous driving.
Problem

Research questions and friction points this paper is trying to address.

Lack robust 3D spatial understanding in autonomous driving
Difficulty constructing accessible 3D representations without annotations
Loss of fine-grained spatial details in vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicit 3D occupancy supervision from 2D inputs
Unified multimodal reasoning with 3D representations
No extra computation during inference phase
🔎 Similar Papers
No similar papers found.
Ruixun Liu
Ruixun Liu
Undergraduates of Xi'an Jiaotong University
computer vision
L
Lingyu Kong
Shanghai Qi Zhi Institute, Fudan University
Derun Li
Derun Li
上海交通大学
H
Hang Zhao
Shanghai Qi Zhi Institute, Tsinghua University