UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics

📅 2026-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited capability of video-language models in understanding multi-agent dynamics and performing spatiotemporal reasoning within complex urban traffic scenarios. To this end, we introduce UDVideoQA, the first benchmark specifically designed for multi-object spatiotemporal reasoning in dynamic urban traffic, constructed from real-world intersection videos. Privacy is preserved through event-driven dynamic blurring, and a unified dense annotation pipeline yields a hierarchical question set encompassing advanced cognitive tasks such as counterfactual reasoning. Experimental results demonstrate that a fine-tuned Qwen2.5-VL 7B model achieves performance comparable to proprietary systems, while also revealing critical limitations in current approaches—including a pervasive perception-reasoning gap and insufficient linguistic diversity in automatically generated questions.

Technology Category

Application Category

📝 Abstract
Understanding the complex, multi-agent dynamics of urban traffic remains a fundamental challenge for video language models. This paper introduces Urban Dynamics VideoQA, a benchmark dataset that captures the unscripted real-world behavior of dynamic urban scenes. UDVideoQA is curated from 16 hours of traffic footage recorded at multiple city intersections under diverse traffic, weather, and lighting conditions. It employs an event-driven dynamic blur technique to ensure privacy preservation without compromising scene fidelity. Using a unified annotation pipeline, the dataset contains 28K question-answer pairs generated across 8 hours of densely annotated video, averaging one question per second. Its taxonomy follows a hierarchical reasoning level, spanning basic understanding and attribution to event reasoning, reverse reasoning, and counterfactual inference, enabling systematic evaluation of both visual grounding and causal reasoning. Comprehensive experiments benchmark 10 SOTA VideoLMs on UDVideoQA and 8 models on a complementary video question generation benchmark. Results reveal a persistent perception-reasoning gap, showing models that excel in abstract inference often fail with fundamental visual grounding. While models like Gemini Pro achieve the highest zero-shot accuracy, fine-tuning the smaller Qwen2.5-VL 7B model on UDVideoQA bridges this gap, achieving performance comparable to proprietary systems. In VideoQGen, Gemini 2.5 Pro, and Qwen3 Max generate the most relevant and complex questions, though all models exhibit limited linguistic diversity, underscoring the need for human-centric evaluation. The UDVideoQA suite, including the dataset, annotation tools, and benchmarks for both VideoQA and VideoQGen, provides a foundation for advancing robust, privacy-aware, and real-world multimodal reasoning. UDVideoQA is available at https://ud-videoqa.github.io/UD-VideoQA/UD-VideoQA/.
Problem

Research questions and friction points this paper is trying to address.

urban dynamics
video question answering
spatio-temporal reasoning
multi-object interaction
traffic video understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic blur for privacy
multi-object spatio-temporal reasoning
hierarchical reasoning taxonomy
video question answering
video question generation
🔎 Similar Papers
No similar papers found.
J
Joseph Raj Vishal
Arizona State University
N
Nagasiri Poluri
Arizona State University
Katha Naik
Katha Naik
Student Researcher, Arizona State University
R
Rutuja Patil
Arizona State University
K
Kashyap Hegde Kota
Arizona State University
K
Krishna Vinod
Arizona State University
P
Prithvi Jai Ramesh
Arizona State University
Mohammad Farhadi
Mohammad Farhadi
PhD Student, Arizona State university
Computer VisionMachine learningwireless sensor networksnetwork security
Y
Yezhou Yang
Arizona State University
Bharatesh Chakravarthi
Bharatesh Chakravarthi
School of Computing and AI, Arizona State University
Event-based VisionITSHuman Pose Estimation