InterAct-Video: Reasoning-Rich Video QA for Urban Traffic

📅 2025-07-19

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing VideoQA models struggle to model the complex spatiotemporal dynamics and multi-agent, multi-event interactions inherent in urban traffic scenes. To address this, we introduce InterAct VideoQA—the first fine-grained, traffic-domain-specific VideoQA benchmark. It comprises over 25,000 high-quality question-answer pairs, manually annotated on 10-second clips extracted from real-world traffic surveillance videos, covering core traffic semantics such as vehicle interactions and event detection. Systematic evaluation across multiple state-of-the-art VideoQA models reveals severe performance degradation on traffic reasoning tasks, confirming their domain generalization limitations. Fine-tuning these models on InterAct VideoQA yields substantial improvements in spatiotemporal reasoning capability. Our work demonstrates the critical importance of domain-specific, fine-grained annotation for advancing video understanding in complex real-world scenarios, establishing a new benchmark and methodological foundation for intelligent traffic analysis.

Technology Category

Application Category

📝 Abstract

Traffic monitoring is crucial for urban mobility, road safety, and intelligent transportation systems (ITS). Deep learning has advanced video-based traffic monitoring through video question answering (VideoQA) models, enabling structured insight extraction from traffic videos. However, existing VideoQA models struggle with the complexity of real-world traffic scenes, where multiple concurrent events unfold across spatiotemporal dimensions. To address these challenges, this paper introduces extbf{InterAct VideoQA}, a curated dataset designed to benchmark and enhance VideoQA models for traffic monitoring tasks. The InterAct VideoQA dataset comprises 8 hours of real-world traffic footage collected from diverse intersections, segmented into 10-second video clips, with over 25,000 question-answer (QA) pairs covering spatiotemporal dynamics, vehicle interactions, incident detection, and other critical traffic attributes. State-of-the-art VideoQA models are evaluated on InterAct VideoQA, exposing challenges in reasoning over fine-grained spatiotemporal dependencies within complex traffic scenarios. Additionally, fine-tuning these models on InterAct VideoQA yields notable performance improvements, demonstrating the necessity of domain-specific datasets for VideoQA. InterAct VideoQA is publicly available as a benchmark dataset to facilitate future research in real-world deployable VideoQA models for intelligent transportation systems. GitHub Repo: https://github.com/joe-rabbit/InterAct_VideoQA

Problem

Research questions and friction points this paper is trying to address.

Enhance VideoQA models for complex traffic scenes

Address spatiotemporal reasoning in traffic monitoring

Improve incident detection and vehicle interaction analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces InterAct VideoQA dataset for traffic monitoring

Enhances VideoQA models with spatiotemporal reasoning

Fine-tunes models for complex traffic scenarios

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding