From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current visual language models (VLMs) exhibit severe limitations in temporal dynamic understanding for autonomous driving, and no dedicated benchmark exists for evaluating temporal reasoning on ego-centric driving videos. To address this gap, we introduce TAD—a novel, fine-grained temporal understanding benchmark specifically designed for autonomous driving—comprising nearly 6,000 question-answer pairs across seven distinct temporal reasoning tasks. To overcome the weak temporal modeling capability of existing VLMs, we propose two training-free, plug-and-play methods: Scene-CoT, which enables ego-centric chain-of-thought reasoning, and TCogMap, a spatiotemporal cognitive map framework integrating fine-grained motion perception with contextual temporal inference. Evaluated on TAD, our methods boost the average accuracy of mainstream VLMs by up to 17.72%, significantly advancing both the evaluation methodology and research frontier in temporal video understanding for autonomous driving.

Technology Category

Application Category

📝 Abstract
Temporal understanding in autonomous driving (AD) remains a significant challenge, even for recent state-of-the-art (SoTA) Vision-Language Models (VLMs). Prior work has introduced datasets and benchmarks aimed at improving temporal reasoning, but these have emphasized other video content, including sports, cooking, and movies. No existing benchmark focuses exclusively on the unique challenges of temporal understanding in ego-centric AD footage. To fill this gap, the Temporal Understanding in Autonomous Driving (TAD) benchmark is presented, which evaluates VLMs' ability to capture the dynamic relationships between actions in AD. TAD comprises nearly 6,000 question-answer (QA) pairs, spanning 7 human-designed tasks. In addition, an evaluation is performed that consists of 9 closed- and open-source generalist models as well as SoTA AD specialist models. When applied to TAD, current SoTA models demonstrated substandard accuracies, largely due to imperfect fine-grained motion understanding. To improve motion understanding and overall accuracy on TAD, two novel training-free solutions are proposed: Scene-CoT, that leverages Chain-of-Thought (CoT) and TCogMap, which incorporates an ego-centric temporal cognitive map. The proposed approaches are integrated with existing VLMs and improve average accuracy on TAD by up to 17.72%. By introducing TAD, benchmarking multiple SoTA models, and proposing effective enhancements, this work aims to catalyze future research on temporal understanding in AD. The benchmark and evaluation code are available at href{https://huggingface.co/datasets/vbdai/TAD}{Hugging Face} and href{https://github.com/vbdi/tad_bench}{Github}, respectively.
Problem

Research questions and friction points this paper is trying to address.

Introduces a benchmark for temporal understanding in autonomous driving videos
Evaluates vision-language models on dynamic action relationships in driving scenes
Proposes training-free methods to improve motion understanding and accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces TAD benchmark for autonomous driving temporal understanding
Proposes Scene-CoT using Chain-of-Thought for motion reasoning
Develops TCogMap with ego-centric temporal cognitive maps
🔎 Similar Papers
No similar papers found.