InterAct-Video: Reasoning-Rich Video QA for Urban Traffic

πŸ“… 2025-07-19
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing VideoQA models struggle to model the complex spatiotemporal dynamics and multi-agent, multi-event interactions inherent in urban traffic scenes. To address this, we introduce InterAct VideoQAβ€”the first fine-grained, traffic-domain-specific VideoQA benchmark. It comprises over 25,000 high-quality question-answer pairs, manually annotated on 10-second clips extracted from real-world traffic surveillance videos, covering core traffic semantics such as vehicle interactions and event detection. Systematic evaluation across multiple state-of-the-art VideoQA models reveals severe performance degradation on traffic reasoning tasks, confirming their domain generalization limitations. Fine-tuning these models on InterAct VideoQA yields substantial improvements in spatiotemporal reasoning capability. Our work demonstrates the critical importance of domain-specific, fine-grained annotation for advancing video understanding in complex real-world scenarios, establishing a new benchmark and methodological foundation for intelligent traffic analysis.

Technology Category

Application Category

πŸ“ Abstract
Traffic monitoring is crucial for urban mobility, road safety, and intelligent transportation systems (ITS). Deep learning has advanced video-based traffic monitoring through video question answering (VideoQA) models, enabling structured insight extraction from traffic videos. However, existing VideoQA models struggle with the complexity of real-world traffic scenes, where multiple concurrent events unfold across spatiotemporal dimensions. To address these challenges, this paper introduces extbf{InterAct VideoQA}, a curated dataset designed to benchmark and enhance VideoQA models for traffic monitoring tasks. The InterAct VideoQA dataset comprises 8 hours of real-world traffic footage collected from diverse intersections, segmented into 10-second video clips, with over 25,000 question-answer (QA) pairs covering spatiotemporal dynamics, vehicle interactions, incident detection, and other critical traffic attributes. State-of-the-art VideoQA models are evaluated on InterAct VideoQA, exposing challenges in reasoning over fine-grained spatiotemporal dependencies within complex traffic scenarios. Additionally, fine-tuning these models on InterAct VideoQA yields notable performance improvements, demonstrating the necessity of domain-specific datasets for VideoQA. InterAct VideoQA is publicly available as a benchmark dataset to facilitate future research in real-world deployable VideoQA models for intelligent transportation systems. GitHub Repo: https://github.com/joe-rabbit/InterAct_VideoQA
Problem

Research questions and friction points this paper is trying to address.

Enhance VideoQA models for complex traffic scenes
Address spatiotemporal reasoning in traffic monitoring
Improve incident detection and vehicle interaction analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces InterAct VideoQA dataset for traffic monitoring
Enhances VideoQA models with spatiotemporal reasoning
Fine-tunes models for complex traffic scenarios
πŸ”Ž Similar Papers
J
Joseph Raj Vishal
Arizona State University
R
Rutuja Patil
Arizona State University
M
Manas Srinivas Gowda
Arizona State University
Katha Naik
Katha Naik
Student Researcher, Arizona State University
Y
Yezhou Yang
Arizona State University
Bharatesh Chakravarthi
Bharatesh Chakravarthi
School of Computing and AI, Arizona State University
Event-based VisionITSHuman Pose Estimation