π€ AI Summary
Current large modelβbased approaches to autonomous driving scene understanding and planning lack effective temporal modeling, leading to inconsistent reasoning over sequential actions and compromising both safety and interpretability. To address this, this work proposes three multi-agent planner architectures incorporating varying degrees of temporal conditioning constraints. The authors establish the first empirical benchmark for temporally aware scene-to-planning reasoning on a subset of BDD-X and introduce evaluation metrics assessing semantic, syntactic, and logical consistency. Experimental results show that while explicit temporal constraints do not significantly improve standard NLP metrics, qualitative analysis reveals their capacity to elicit forward-looking risk assessment, stabilize corrective behaviors, and enhance strategic diversity. The study also highlights limitations in current prompt engineering practices regarding temporal grounding.
π Abstract
Recent attempts to support high-level scene interpretation and planning in Autonomous Vehicles (AVs) using ensembles of Large Language Models (LLMs) and Large Multimodal Models (LMMs) continue to treat time as a secondary property. This lack of temporal grounding leads to inconsistencies in reasoning about continuous actions, undermining both safety and interpretability. This work explores whether temporal conditioning within inter-agent communication can preserve or enhance coherence without introducing degradation in semantic or logical consistency. To investigate this, we introduce three planner architectures with progressively increasing temporal integration and evaluate them on curated subsets of the BDD-X dataset using semantic, syntactic, and logical metrics. Results show that while temporal conditioning reshapes reasoning style, it yields no statistically significant improvements in standard NLP-based correctness metrics. However, qualitative analysis reveals predictive hazard reasoning, stable corrective behavior, and strategic divergence in the Sentinel. These findings clarify the limits of prompt-based temporal grounding and establish the first empirical benchmark for temporal scene-to-plan reasoning.