π€ AI Summary
Existing VideoQA models struggle to model the complex spatiotemporal dynamics and multi-agent, multi-event interactions inherent in urban traffic scenes. To address this, we introduce InterAct VideoQAβthe first fine-grained, traffic-domain-specific VideoQA benchmark. It comprises over 25,000 high-quality question-answer pairs, manually annotated on 10-second clips extracted from real-world traffic surveillance videos, covering core traffic semantics such as vehicle interactions and event detection. Systematic evaluation across multiple state-of-the-art VideoQA models reveals severe performance degradation on traffic reasoning tasks, confirming their domain generalization limitations. Fine-tuning these models on InterAct VideoQA yields substantial improvements in spatiotemporal reasoning capability. Our work demonstrates the critical importance of domain-specific, fine-grained annotation for advancing video understanding in complex real-world scenarios, establishing a new benchmark and methodological foundation for intelligent traffic analysis.
π Abstract
Traffic monitoring is crucial for urban mobility, road safety, and intelligent transportation systems (ITS). Deep learning has advanced video-based traffic monitoring through video question answering (VideoQA) models, enabling structured insight extraction from traffic videos. However, existing VideoQA models struggle with the complexity of real-world traffic scenes, where multiple concurrent events unfold across spatiotemporal dimensions. To address these challenges, this paper introduces extbf{InterAct VideoQA}, a curated dataset designed to benchmark and enhance VideoQA models for traffic monitoring tasks. The InterAct VideoQA dataset comprises 8 hours of real-world traffic footage collected from diverse intersections, segmented into 10-second video clips, with over 25,000 question-answer (QA) pairs covering spatiotemporal dynamics, vehicle interactions, incident detection, and other critical traffic attributes. State-of-the-art VideoQA models are evaluated on InterAct VideoQA, exposing challenges in reasoning over fine-grained spatiotemporal dependencies within complex traffic scenarios. Additionally, fine-tuning these models on InterAct VideoQA yields notable performance improvements, demonstrating the necessity of domain-specific datasets for VideoQA. InterAct VideoQA is publicly available as a benchmark dataset to facilitate future research in real-world deployable VideoQA models for intelligent transportation systems. GitHub Repo: https://github.com/joe-rabbit/InterAct_VideoQA