Weakly Supervised Video Scene Graph Generation via Natural Language Supervision

📅 2025-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high annotation cost of fully supervised video scene graph generation (VidSGG), this paper proposes the first weakly supervised framework leveraging only natural language video descriptions. To tackle alignment challenges arising from ambiguous temporal grounding and variable action durations in captions, we introduce a temporal-aware caption segmentation module and a variable-duration action alignment mechanism—marking the first integration of explicit temporal structure modeling into weakly supervised VidSGG. Our approach further incorporates large language model–guided semantic understanding, dynamic frame-sentence alignment, and a weakly supervised multiple-instance learning strategy to reliably map coarse-grained captions to fine-grained scene graphs. On the Action Genome dataset, our method significantly outperforms direct adaptations of image-based weakly supervised methods and demonstrates zero-shot generalization to unseen actions.

Technology Category

Application Category

📝 Abstract
Existing Video Scene Graph Generation (VidSGG) studies are trained in a fully supervised manner, which requires all frames in a video to be annotated, thereby incurring high annotation cost compared to Image Scene Graph Generation (ImgSGG). Although the annotation cost of VidSGG can be alleviated by adopting a weakly supervised approach commonly used for ImgSGG (WS-ImgSGG) that uses image captions, there are two key reasons that hinder such a naive adoption: 1) Temporality within video captions, i.e., unlike image captions, video captions include temporal markers (e.g., before, while, then, after) that indicate time related details, and 2) Variability in action duration, i.e., unlike human actions in image captions, human actions in video captions unfold over varying duration. To address these issues, we propose a Natural Language-based Video Scene Graph Generation (NL-VSGG) framework that only utilizes the readily available video captions for training a VidSGG model. NL-VSGG consists of two key modules: Temporality-aware Caption Segmentation (TCS) module and Action Duration Variability-aware caption-frame alignment (ADV) module. Specifically, TCS segments the video captions into multiple sentences in a temporal order based on a Large Language Model (LLM), and ADV aligns each segmented sentence with appropriate frames considering the variability in action duration. Our approach leads to a significant enhancement in performance compared to simply applying the WS-ImgSGG pipeline to VidSGG on the Action Genome dataset. As a further benefit of utilizing the video captions as weak supervision, we show that the VidSGG model trained by NL-VSGG is able to predict a broader range of action classes that are not included in the training data, which makes our framework practical in reality.
Problem

Research questions and friction points this paper is trying to address.

Reduces high annotation costs in video scene graph generation.
Handles temporality and action duration variability in video captions.
Utilizes video captions for weakly supervised model training.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes video captions for training
Segments captions with LLM temporality
Aligns sentences with frame duration variability