TRKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the inaccurate object localization and low relational confidence of external static detectors in weakly supervised dynamic scene graph generation (WS-DSGG), this paper proposes a temporal-aware, relation-aware knowledge transfer framework. Methodologically, it introduces a relation-aware knowledge mining mechanism and an optical-flow-guided cross-frame attention enhancement module, integrated with class-specific attention maps and a dual-stream motion-appearance fusion architecture, to enable adaptive proposal refinement and pseudo-label optimization. Crucially, relational decoding priors are explicitly incorporated into the detector optimization process, steering the detector toward relation-aware learning. Evaluated on the Action Genome dataset, the framework achieves state-of-the-art performance, significantly improving both object detection and relational prediction accuracy. This work establishes a transferable knowledge transfer paradigm for weakly supervised video understanding.

Technology Category

Application Category

📝 Abstract
Dynamic Scene Graph Generation (DSGG) aims to create a scene graph for each video frame by detecting objects and predicting their relationships. Weakly Supervised DSGG (WS-DSGG) reduces annotation workload by using an unlocalized scene graph from a single frame per video for training. Existing WS-DSGG methods depend on an off-the-shelf external object detector to generate pseudo labels for subsequent DSGG training. However, detectors trained on static, object-centric images struggle in dynamic, relation-aware scenarios required for DSGG, leading to inaccurate localization and low-confidence proposals. To address the challenges posed by external object detectors in WS-DSGG, we propose a Temporal-enhanced Relation-aware Knowledge Transferring (TRKT) method, which leverages knowledge to enhance detection in relation-aware dynamic scenarios. TRKT is built on two key components:(1)Relation-aware knowledge mining: we first employ object and relation class decoders that generate category-specific attention maps to highlight both object regions and interactive areas. Then we propose an Inter-frame Attention Augmentation strategy that exploits optical flow for neighboring frames to enhance the attention maps, making them motion-aware and robust to motion blur. This step yields relation- and motion-aware knowledge mining for WS-DSGG. (2) we introduce a Dual-stream Fusion Module that integrates category-specific attention maps into external detections to refine object localization and boost confidence scores for object proposals. Extensive experiments demonstrate that TRKT achieves state-of-the-art performance on Action Genome dataset. Our code is avaliable at https://github.com/XZPKU/TRKT.git.
Problem

Research questions and friction points this paper is trying to address.

Enhance object detection in dynamic, relation-aware scenarios for WS-DSGG
Improve object localization and confidence scores in weakly supervised DSGG
Address limitations of external object detectors in dynamic video frames
Innovation

Methods, ideas, or system contributions that make the work stand out.

Relation-aware knowledge mining with attention maps
Inter-frame Attention Augmentation using optical flow
Dual-stream Fusion Module for refined object localization
🔎 Similar Papers
No similar papers found.