🤖 AI Summary
This work addresses the challenge of video scene graph generation under weak supervision, where only sparse temporal labels are available and object bounding boxes are absent. In this setting, conventional approaches suffer from abundant noisy object pairs introduced by generic object detectors, which obscure meaningful relational structures. To mitigate this issue, the paper proposes a learnable object-pair affinity mechanism comprising three key components: Pair Affinity Learning and Scoring (PALS), Pair Affinity Modulation (PAM), and Relation-Aware Matching (RAM). These modules jointly suppress spurious interactions during inference ranking and contextual modeling, thereby emphasizing semantically plausible relation triplets. Furthermore, a visual-linguistic alignment-based pseudo-label refinement strategy is integrated to substantially alleviate noise in relation modeling. Evaluated on the Action Genome dataset, the proposed method significantly outperforms existing weakly supervised approaches, achieving state-of-the-art performance.
📝 Abstract
Weakly-supervised video scene graph generation (WS-VSGG) aims to parse video content into structured relational triplets without bounding box annotations and with only sparse temporal labeling, significantly reducing annotation costs. Without ground-truth bounding boxes, these methods rely on off-the-shelf detectors to generate object proposals, yet largely overlook a fundamental discrepancy from fullysupervised pipelines. Fully-supervised detectors implicitly filter out noninteractive objects, while off-the-shelf detectors indiscriminately detect all visible objects, overwhelming relation models with noisy pairs.We address this by introducing a learnable pair affinity that estimates the likelihood of interaction between subject-object pairs. Through Pair Affinity Learning and Scoring (PALS), pair affinity is incorporated into inferencetime ranking and further integrated into contextual reasoning through Pair Affinity Modulation (PAM), enabling the model to suppress noninteractive pairs and focus on relationally meaningful ones. To provide cleaner supervision for pair affinity learning, we further propose Relation- Aware Matching (RAM), which leverages vision-language grounding to resolve class-level ambiguity in pseudo-label generation. Extensive experiments on Action Genome demonstrate that our approach consistently yields substantial improvements across different baselines and backbones, achieving state-of-the-art WS-VSGG performance.