Tracking Meets Large Multimodal Models for Driving Scenario Understanding

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitation of large multimodal models (LMMs) in autonomous driving—specifically, their inability to model 3D spatiotemporal dynamics due to exclusive reliance on 2D images—this work introduces structured 3D object tracking information as a spatiotemporal prior into LMMs for the first time. We propose a lightweight track encoder and a vision–tracking cross-modal fusion mechanism, coupled with a self-supervised pretraining strategy that enables efficient spatiotemporal representation learning without additional video or 3D computation overhead. On DriveLM-nuScenes, our method achieves a 9.5% absolute accuracy gain, a +7.04-point improvement in ChatGPT-based evaluation, and a 9.4% overall performance boost; on DriveLM-CARLA, it yields a 3.7% increase in final task score. This work establishes a scalable, spatiotemporally aware paradigm for endowing LMMs with embodied intelligent driving capabilities.

Technology Category

Application Category

📝 Abstract
Large Multimodal Models (LMMs) have recently gained prominence in autonomous driving research, showcasing promising capabilities across various emerging benchmarks. LMMs specifically designed for this domain have demonstrated effective perception, planning, and prediction skills. However, many of these methods underutilize 3D spatial and temporal elements, relying mainly on image data. As a result, their effectiveness in dynamic driving environments is limited. We propose to integrate tracking information as an additional input to recover 3D spatial and temporal details that are not effectively captured in the images. We introduce a novel approach for embedding this tracking information into LMMs to enhance their spatiotemporal understanding of driving scenarios. By incorporating 3D tracking data through a track encoder, we enrich visual queries with crucial spatial and temporal cues while avoiding the computational overhead associated with processing lengthy video sequences or extensive 3D inputs. Moreover, we employ a self-supervised approach to pretrain the tracking encoder to provide LMMs with additional contextual information, significantly improving their performance in perception, planning, and prediction tasks for autonomous driving. Experimental results demonstrate the effectiveness of our approach, with a gain of 9.5% in accuracy, an increase of 7.04 points in the ChatGPT score, and 9.4% increase in the overall score over baseline models on DriveLM-nuScenes benchmark, along with a 3.7% final score improvement on DriveLM-CARLA. Our code is available at https://github.com/mbzuai-oryx/TrackingMeetsLMM
Problem

Research questions and friction points this paper is trying to address.

Enhance spatiotemporal understanding in autonomous driving scenarios
Integrate 3D tracking data to improve perception and prediction
Address limitations of image-only data in dynamic environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates tracking data into Large Multimodal Models
Uses track encoder for 3D spatial-temporal enhancement
Self-supervised pretraining improves autonomous driving tasks
🔎 Similar Papers
No similar papers found.
A
Ayesha Ishaq
Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)
Jean Lahoud
Jean Lahoud
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Computer Vision
Fahad Shahbaz Khan
Fahad Shahbaz Khan
MBZUAI, Linköping University Sweden
Computer VisionObject RecognitionGenerative AIAI for Science
S
Salman Khan
Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), Australian National University
Hisham Cholakkal
Hisham Cholakkal
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Computer VisionLarge Multimodal ModelsLLMHealthcare Foundation ModelConversational Assistant
R
R. Anwer
Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)