π€ AI Summary
To address the ambiguity, strong contextual dependency, and challenges in modeling 3D spatial relations and dynamic scene evolution inherent in natural language instructions for autonomous driving, this paper proposes ThinkDeeperβa novel framework that introduces world models into vision-language grounding. It constructs a Spatially-Aware World Model (SA-WM) and integrates a hypergraph decoder to enable future-state-reasoning-based localization decisions. Methodologically, it innovatively unifies instruction-aware latent state modeling, forward sequence prediction, and hypergraph-guided hierarchical multimodal fusion, while leveraging RAG-augmented LLMs to generate high-quality semantic annotations. Evaluated on multiple benchmarks, ThinkDeeper achieves state-of-the-art performance: first place on Talk2Car, and consistent superiority over prior art on DrivePilot, MoCAD, and RefCOCO variants. Notably, it attains optimal accuracy using only 50% of standard training data, demonstrating significantly enhanced robustness and generalization.
π Abstract
Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.