Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the scarcity of large-scale, diverse spatiotemporal scene graph datasets in video understanding by introducing SVG2, a novel dataset comprising 636,000 videos and enabling, for the first time, fully automatic synthesis of large-scale spatiotemporal scene graphs. The authors propose TRaSER, an end-to-end model that integrates multi-scale panoptic segmentation, online-offline trajectory tracking, and vision-language model enhancement, featuring a trajectory-aligned token ordering scheme and a dual resampling mechanism. Experimental results demonstrate significant performance gains across multiple benchmarks: relation detection improves by 15–20%, object prediction by 30–40%, and attribute prediction by 15%. When applied to video question answering, the approach further yields accuracy improvements of 1.5–4.6%.

Technology Category

Application Category

📝 Abstract

We introduce Synthetic Visual Genome 2 (SVG2), a large-scale panoptic video scene graph dataset. SVG2 contains over 636K videos with 6.6M objects, 52.0M attributes, and 6.7M relations, providing an order-of-magnitude increase in scale and diversity over prior spatio-temporal scene graph datasets. To create SVG2, we design a fully automated pipeline that combines multi-scale panoptic segmentation, online-offline trajectory tracking with automatic new-object discovery, per-trajectory semantic parsing, and GPT-5-based spatio-temporal relation inference. Building on this resource, we train TRaSER, a video scene graph generation model. TRaSER augments VLMs with a trajectory-aligned token arrangement mechanism and new modules: an object-trajectory resampler and a temporal-window resampler to convert raw videos and panoptic trajectories into compact spatio-temporal scene graphs in a single forward pass. The temporal-window resampler binds visual tokens to short trajectory segments to preserve local motion and temporal semantics, while the object-trajectory resampler aggregates entire trajectories to maintain global context for objects. On the PVSG, VIPSeg, VidOR and SVG2 test datasets, TRaSER improves relation detection by +15 to 20%, object prediction by +30 to 40% over the strongest open-source baselines and by +13% over GPT-5, and attribute prediction by +15%. When TRaSER's generated scene graphs are sent to a VLM for video question answering, it delivers a +1.5 to 4.6% absolute accuracy gain over using video only or video augmented with Qwen2.5-VL's generated scene graphs, demonstrating the utility of explicit spatio-temporal scene graphs as an intermediate representation.

Problem

Research questions and friction points this paper is trying to address.

spatio-temporal scene graph

video understanding

large-scale dataset

object relations

panoptic video

Innovation

Methods, ideas, or system contributions that make the work stand out.

spatio-temporal scene graph

panoptic video understanding

trajectory-aligned tokenization