🤖 AI Summary
To address the redundancy and inefficiency of temporal modeling in dynamic scene graph generation—where existing methods rely on dense, abstract fully connected temporal relationships without discriminative modeling of genuine temporal dynamics—this paper proposes a saliency-aware sparse temporal edge modeling mechanism. We introduce, for the first time, explicit and sparse temporal edges, connecting only object pairs exhibiting significant temporal correlation to avoid spurious modeling. Our approach integrates temporal saliency estimation, dynamic graph neural networks, and multi-frame feature alignment to enable efficient and discriminative temporal relation learning. On the Scene Graph Detection task, our method achieves a 4.4% improvement in mean Recall@50; for action recognition, it attains a 0.6% higher mAP than the state-of-the-art. Moreover, the learned representations demonstrate strong cross-task transferability.
📝 Abstract
Representing a dynamic scene using a structured spatial-temporal scene graph is a novel and particularly challenging task. To tackle this task, it is crucial to learn the temporal interactions between objects in addition to their spatial relations. Due to the lack of explicitly annotated temporal relations in current benchmark datasets, most of the existing spatial-temporal scene graph generation methods build dense and abstract temporal connections among all objects across frames. However, not all temporal connections are encoding meaningful temporal dynamics. We propose a novel spatial-temporal scene graph generation method that selectively builds temporal connections only between temporal-relevant objects pairs and represents the temporal relations as explicit edges in the scene graph. The resulting sparse and explicit temporal representation allows us to improve upon strong scene graph generation baselines by up to $4.4%$ in Scene Graph Detection. In addition, we show that our approach can be leveraged to improve downstream vision tasks. Particularly, applying our approach to action recognition, shows 0.6% gain in mAP in comparison to the state-of-the-art