Salient Temporal Encoding for Dynamic Scene Graph Generation

📅 2025-03-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the redundancy and inefficiency of temporal modeling in dynamic scene graph generation—where existing methods rely on dense, abstract fully connected temporal relationships without discriminative modeling of genuine temporal dynamics—this paper proposes a saliency-aware sparse temporal edge modeling mechanism. We introduce, for the first time, explicit and sparse temporal edges, connecting only object pairs exhibiting significant temporal correlation to avoid spurious modeling. Our approach integrates temporal saliency estimation, dynamic graph neural networks, and multi-frame feature alignment to enable efficient and discriminative temporal relation learning. On the Scene Graph Detection task, our method achieves a 4.4% improvement in mean Recall@50; for action recognition, it attains a 0.6% higher mAP than the state-of-the-art. Moreover, the learned representations demonstrate strong cross-task transferability.

Technology Category

Application Category

📝 Abstract

Representing a dynamic scene using a structured spatial-temporal scene graph is a novel and particularly challenging task. To tackle this task, it is crucial to learn the temporal interactions between objects in addition to their spatial relations. Due to the lack of explicitly annotated temporal relations in current benchmark datasets, most of the existing spatial-temporal scene graph generation methods build dense and abstract temporal connections among all objects across frames. However, not all temporal connections are encoding meaningful temporal dynamics. We propose a novel spatial-temporal scene graph generation method that selectively builds temporal connections only between temporal-relevant objects pairs and represents the temporal relations as explicit edges in the scene graph. The resulting sparse and explicit temporal representation allows us to improve upon strong scene graph generation baselines by up to $4.4%$ in Scene Graph Detection. In addition, we show that our approach can be leveraged to improve downstream vision tasks. Particularly, applying our approach to action recognition, shows 0.6% gain in mAP in comparison to the state-of-the-art

Problem

Research questions and friction points this paper is trying to address.

Dynamic scene graph generation with temporal interactions.

Selective temporal connections for meaningful dynamics.

Improving scene graph detection and action recognition.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective temporal connections between relevant objects

Explicit temporal edges in scene graph

Improves scene graph detection by 4.4%

🔎 Similar Papers

No similar papers found.

Authors to Follow