Temporally Consistent Dynamic Scene Graphs: An End-to-End Approach for Action Tracklet Generation

📅 2024-12-03

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the challenge of dynamic scene modeling in video understanding by proposing the Temporally Consistent Dynamic Scene Graph (TCDSG) framework, enabling joint modeling and robust tracking of subject–object relationships across frames. Methodologically: (1) an adaptive attention decoder with closed-loop feedback is designed to enhance temporal consistency in bipartite graph matching; (2) persistent object ID annotations are introduced for the first time on the MEVA dataset; and (3) an end-to-end multi-frame joint relation modeling architecture is adopted. Evaluated on Action Genome, OpenPVSG, and MEVA benchmarks, TCDSG achieves over 60% improvement in temporal recall@k, significantly boosting robustness in long-sequence interaction tracking. The framework advances fine-grained action understanding for real-world applications such as surveillance and autonomous driving.

Technology Category

Application Category

📝 Abstract

Understanding video content is pivotal for advancing real-world applications like activity recognition, autonomous systems, and human-computer interaction. While scene graphs are adept at capturing spatial relationships between objects in individual frames, extending these representations to capture dynamic interactions across video sequences remains a significant challenge. To address this, we present TCDSG, Temporally Consistent Dynamic Scene Graphs, an innovative end-to-end framework that detects, tracks, and links subject-object relationships across time, generating action tracklets, temporally consistent sequences of entities and their interactions. Our approach leverages a novel bipartite matching mechanism, enhanced by adaptive decoder queries and feedback loops, ensuring temporal coherence and robust tracking over extended sequences. This method not only establishes a new benchmark by achieving over 60% improvement in temporal recall@k on the Action Genome, OpenPVSG, and MEVA datasets but also pioneers the augmentation of MEVA with persistent object ID annotations for comprehensive tracklet generation. By seamlessly integrating spatial and temporal dynamics, our work sets a new standard in multi-frame video analysis, opening new avenues for high-impact applications in surveillance, autonomous navigation, and beyond.

Problem

Research questions and friction points this paper is trying to address.

Extending scene graphs to capture dynamic video interactions

Ensuring temporal coherence in object tracking

Generating action tracklets for multi-frame video analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end framework for dynamic scene graphs

Novel bipartite matching with adaptive queries

Persistent object ID annotations for tracklets

🔎 Similar Papers

No similar papers found.