MMDrive: Interactive Scene Understanding Beyond Vision with Multi-representational Fusion

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing vision-language models are constrained by 2D image understanding paradigms, limiting their ability to model 3D spatial structure and achieve deep cross-modal semantic fusion—thereby hindering autonomous driving perception in complex scenes. To address this, we propose the first 3D scene understanding framework jointly modeling occupancy grids, LiDAR point clouds, and textual descriptions. Our approach introduces a novel text-guided multimodal modulator and a cross-modal abstractor, enabling semantic-driven adaptive fusion via dynamic weighted modulation, learnable abstract tokens, and cross-modal feature alignment—yielding compact, task-relevant summaries. Evaluated on DriveLM, our method achieves BLEU-4 of 54.56 and METEOR of 41.78; on NuScenes-QA, it attains 62.7% accuracy—substantially outperforming prior state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Vision-language models enable the understanding and reasoning of complex traffic scenarios through multi-source information fusion, establishing it as a core technology for autonomous driving. However, existing vision-language models are constrained by the image understanding paradigm in 2D plane, which restricts their capability to perceive 3D spatial information and perform deep semantic fusion, resulting in suboptimal performance in complex autonomous driving environments. This study proposes MMDrive, an multimodal vision-language model framework that extends traditional image understanding to a generalized 3D scene understanding framework. MMDrive incorporates three complementary modalities, including occupancy maps, LiDAR point clouds, and textual scene descriptions. To this end, it introduces two novel components for adaptive cross-modal fusion and key information extraction. Specifically, the Text-oriented Multimodal Modulator dynamically weights the contributions of each modality based on the semantic cues in the question, guiding context-aware feature integration. The Cross-Modal Abstractor employs learnable abstract tokens to generate compact, cross-modal summaries that highlight key regions and essential semantics. Comprehensive evaluations on the DriveLM and NuScenes-QA benchmarks demonstrate that MMDrive achieves significant performance gains over existing vision-language models for autonomous driving, with a BLEU-4 score of 54.56 and METEOR of 41.78 on DriveLM, and an accuracy score of 62.7% on NuScenes-QA. MMDrive effectively breaks the traditional image-only understanding barrier, enabling robust multimodal reasoning in complex driving environments and providing a new foundation for interpretable autonomous driving scene understanding.

Problem

Research questions and friction points this paper is trying to address.

Extends 2D image understanding to generalized 3D scene understanding for autonomous driving

Fuses occupancy maps, LiDAR point clouds, and text for robust multimodal reasoning

Enables adaptive cross-modal fusion and key information extraction in complex driving environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends 2D vision to 3D scene understanding with multimodal fusion.

Uses adaptive cross-modal fusion via Text-oriented Multimodal Modulator.

Employs Cross-Modal Abstractor for compact key semantic summaries.

🔎 Similar Papers

No similar papers found.

Authors to Follow