UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning

📅 2025-09-07

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Video Scene Graph Generation (VidSGG) suffers from a fragmentation between bounding-box-level and pixel-level tasks, requiring multi-stage training and task-specific architectures. This paper proposes UNO, a unified single-stage framework that, for the first time, jointly models coarse-grained object detection and fine-grained panoptic relation segmentation end-to-end within a single model. Its key innovations are: (1) an object-centric expanded slot attention mechanism enabling cross-frame object consistency modeling without explicit tracking; (2) a dynamic triplet prediction module coupled with object–relation slot feature decoupling to support multi-granularity joint optimization; and (3) temporal consistency learning to enhance cross-frame semantic stability. UNO achieves state-of-the-art performance on both box-level (VideoGraphs) and pixel-level (Panoptic VidSGG) benchmarks, reduces parameter count by 37%, accelerates inference by 2.1×, and demonstrates significantly improved generalization.

Technology Category

Application Category

📝 Abstract

Video Scene Graph Generation (VidSGG) aims to represent dynamic visual content by detecting objects and modeling their temporal interactions as structured graphs. Prior studies typically target either coarse-grained box-level or fine-grained panoptic pixel-level VidSGG, often requiring task-specific architectures and multi-stage training pipelines. In this paper, we present UNO (UNified Object-centric VidSGG), a single-stage, unified framework that jointly addresses both tasks within an end-to-end architecture. UNO is designed to minimize task-specific modifications and maximize parameter sharing, enabling generalization across different levels of visual granularity. The core of UNO is an extended slot attention mechanism that decomposes visual features into object and relation slots. To ensure robust temporal modeling, we introduce object temporal consistency learning, which enforces consistent object representations across frames without relying on explicit tracking modules. Additionally, a dynamic triplet prediction module links relation slots to corresponding object pairs, capturing evolving interactions over time. We evaluate UNO on standard box-level and pixel-level VidSGG benchmarks. Results demonstrate that UNO not only achieves competitive performance across both tasks but also offers improved efficiency through a unified, object-centric design.

Problem

Research questions and friction points this paper is trying to address.

Unifying coarse and fine-grained video scene graph generation

Minimizing task-specific modifications and maximizing parameter sharing

Enabling generalization across different visual granularity levels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extended slot attention mechanism for object decomposition

Object temporal consistency learning without tracking modules

Dynamic triplet prediction module for evolving interactions

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding

2024-02-20International Conference on Machine LearningCitations: 30

TikTok

San Jose, California

Research Engineer/Scientist (all levels), World Models

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence