UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning

๐Ÿ“… 2025-09-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Video Scene Graph Generation (VidSGG) suffers from a fragmentation between bounding-box-level and pixel-level tasks, requiring multi-stage training and task-specific architectures. This paper proposes UNO, a unified single-stage framework that, for the first time, jointly models coarse-grained object detection and fine-grained panoptic relation segmentation end-to-end within a single model. Its key innovations are: (1) an object-centric expanded slot attention mechanism enabling cross-frame object consistency modeling without explicit tracking; (2) a dynamic triplet prediction module coupled with objectโ€“relation slot feature decoupling to support multi-granularity joint optimization; and (3) temporal consistency learning to enhance cross-frame semantic stability. UNO achieves state-of-the-art performance on both box-level (VideoGraphs) and pixel-level (Panoptic VidSGG) benchmarks, reduces parameter count by 37%, accelerates inference by 2.1ร—, and demonstrates significantly improved generalization.

Technology Category

Application Category

๐Ÿ“ Abstract
Video Scene Graph Generation (VidSGG) aims to represent dynamic visual content by detecting objects and modeling their temporal interactions as structured graphs. Prior studies typically target either coarse-grained box-level or fine-grained panoptic pixel-level VidSGG, often requiring task-specific architectures and multi-stage training pipelines. In this paper, we present UNO (UNified Object-centric VidSGG), a single-stage, unified framework that jointly addresses both tasks within an end-to-end architecture. UNO is designed to minimize task-specific modifications and maximize parameter sharing, enabling generalization across different levels of visual granularity. The core of UNO is an extended slot attention mechanism that decomposes visual features into object and relation slots. To ensure robust temporal modeling, we introduce object temporal consistency learning, which enforces consistent object representations across frames without relying on explicit tracking modules. Additionally, a dynamic triplet prediction module links relation slots to corresponding object pairs, capturing evolving interactions over time. We evaluate UNO on standard box-level and pixel-level VidSGG benchmarks. Results demonstrate that UNO not only achieves competitive performance across both tasks but also offers improved efficiency through a unified, object-centric design.
Problem

Research questions and friction points this paper is trying to address.

Unifying coarse and fine-grained video scene graph generation
Minimizing task-specific modifications and maximizing parameter sharing
Enabling generalization across different visual granularity levels
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extended slot attention mechanism for object decomposition
Object temporal consistency learning without tracking modules
Dynamic triplet prediction module for evolving interactions
๐Ÿ”Ž Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30