Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

To address the ambiguity, strong contextual dependency, and challenges in modeling 3D spatial relations and dynamic scene evolution inherent in natural language instructions for autonomous driving, this paper proposes ThinkDeeper—a novel framework that introduces world models into vision-language grounding. It constructs a Spatially-Aware World Model (SA-WM) and integrates a hypergraph decoder to enable future-state-reasoning-based localization decisions. Methodologically, it innovatively unifies instruction-aware latent state modeling, forward sequence prediction, and hypergraph-guided hierarchical multimodal fusion, while leveraging RAG-augmented LLMs to generate high-quality semantic annotations. Evaluated on multiple benchmarks, ThinkDeeper achieves state-of-the-art performance: first place on Talk2Car, and consistent superiority over prior art on DrivePilot, MoCAD, and RefCOCO variants. Notably, it attains optimal accuracy using only 50% of standard training data, demonstrating significantly enhanced robustness and generalization.

Technology Category

Application Category

📝 Abstract

Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.

Problem

Research questions and friction points this paper is trying to address.

Interpreting ambiguous natural-language commands for autonomous vehicle object localization.

Addressing lack of 3D spatial reasoning and scene evolution anticipation in grounding.

Enhancing robustness in challenging scenes like long-text and multi-agent scenarios.

Innovation

Methods, ideas, or system contributions that make the work stand out.

World model predicts future spatial states for disambiguation

Hypergraph decoder fuses multimodal inputs hierarchically

RAG and CoT LLM pipeline generates dataset annotations

🔎 Similar Papers

MUVO: A Multimodal Generative World Model for Autonomous Driving with Geometric Representations