Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the robustness deficiency in high-level autonomous driving question answering (encompassing perception, prediction, and planning) under visual degradation, this paper proposes a two-stage multimodal question-answering framework. Methodologically, it builds upon the Qwen2.5-VL-32B foundation model, integrating inputs from six surround-view cameras, historical frame sequences, and nuScenes scene metadata. We design task-specific few-shot chain-of-thought prompting, augmented with context-aware instructions and a self-consistency–driven reasoning-chain integration mechanism. Our core contributions are (1) metadata-enhanced contextual modeling and (2) multi-path reasoning consistency verification. Evaluated on a dedicated driving QA benchmark, our approach achieves 67.37% overall accuracy—substantially outperforming baselines—and maintains 96% accuracy under severe visual perturbations, demonstrating significant improvements in both robustness and answer reliability.

Technology Category

Application Category

📝 Abstract
We present a two-phase vision-language QA system for autonomous driving that answers high-level perception, prediction, and planning questions. In Phase-1, a large multimodal LLM (Qwen2.5-VL-32B) is conditioned on six-camera inputs, a short temporal window of history, and a chain-of-thought prompt with few-shot exemplars. A self-consistency ensemble (multiple sampled reasoning chains) further improves answer reliability. In Phase-2, we augment the prompt with nuScenes scene metadata (object annotations, ego-vehicle state, etc.) and category-specific question instructions (separate prompts for perception, prediction, planning tasks). In experiments on a driving QA benchmark, our approach significantly outperforms the baseline Qwen2.5 models. For example, using 5 history frames and 10-shot prompting in Phase-1 yields 65.1% overall accuracy (vs.62.61% with zero-shot); applying self-consistency raises this to 66.85%. Phase-2 achieves 67.37% overall. Notably, the system maintains 96% accuracy under severe visual corruption. These results demonstrate that carefully engineered prompts and contextual grounding can greatly enhance high-level driving QA with pretrained vision-language models.
Problem

Research questions and friction points this paper is trying to address.

Developing a robust vision-language QA system for autonomous driving scenarios
Answering high-level perception, prediction, and planning questions accurately
Enhancing driving QA reliability through metadata grounding and specialized prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-phase vision-language QA system for autonomous driving
Metadata-augmented prompts with object annotations and ego-state
Self-consistency ensemble improves reliability through multiple reasoning chains
🔎 Similar Papers
No similar papers found.
S
Seungjun Yu
Korea Advanced Institute of Science and Technology
Junsung Park
Junsung Park
Seoul National University
Deep LearningMulti-modal Learning
Y
Youngsun Lim
Korea Advanced Institute of Science and Technology
Hyunjung Shim
Hyunjung Shim
Associate Professor, KAIST
Computer visionmachine learning