๐ค AI Summary
Video Scene Graph Generation (VidSGG) suffers from a fragmentation between bounding-box-level and pixel-level tasks, requiring multi-stage training and task-specific architectures. This paper proposes UNO, a unified single-stage framework that, for the first time, jointly models coarse-grained object detection and fine-grained panoptic relation segmentation end-to-end within a single model. Its key innovations are: (1) an object-centric expanded slot attention mechanism enabling cross-frame object consistency modeling without explicit tracking; (2) a dynamic triplet prediction module coupled with objectโrelation slot feature decoupling to support multi-granularity joint optimization; and (3) temporal consistency learning to enhance cross-frame semantic stability. UNO achieves state-of-the-art performance on both box-level (VideoGraphs) and pixel-level (Panoptic VidSGG) benchmarks, reduces parameter count by 37%, accelerates inference by 2.1ร, and demonstrates significantly improved generalization.
๐ Abstract
Video Scene Graph Generation (VidSGG) aims to represent dynamic visual content by detecting objects and modeling their temporal interactions as structured graphs. Prior studies typically target either coarse-grained box-level or fine-grained panoptic pixel-level VidSGG, often requiring task-specific architectures and multi-stage training pipelines. In this paper, we present UNO (UNified Object-centric VidSGG), a single-stage, unified framework that jointly addresses both tasks within an end-to-end architecture. UNO is designed to minimize task-specific modifications and maximize parameter sharing, enabling generalization across different levels of visual granularity. The core of UNO is an extended slot attention mechanism that decomposes visual features into object and relation slots. To ensure robust temporal modeling, we introduce object temporal consistency learning, which enforces consistent object representations across frames without relying on explicit tracking modules. Additionally, a dynamic triplet prediction module links relation slots to corresponding object pairs, capturing evolving interactions over time. We evaluate UNO on standard box-level and pixel-level VidSGG benchmarks. Results demonstrate that UNO not only achieves competitive performance across both tasks but also offers improved efficiency through a unified, object-centric design.