Can We Build Scene Graphs, Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow Matching

📅 2026-04-17

📈 Citations: 0

✨ Influential: 0

career value

132K/year

🤖 AI Summary

This work addresses the limitations of conventional scene graph generation methods, which typically frame the task as a one-shot deterministic classification and thus fail to capture the joint dynamic generation of objects and their relationships. The paper introduces flow matching to this domain for the first time, reformulating it as a continuous-time optimal transport problem in a hybrid discrete-continuous state space. Starting from a noisy graph, the model progressively generates complete scene graphs—integrating geometric (bounding boxes) and semantic (object categories and predicates) information—guided by image-conditioned signals. The approach combines VQ-VAE quantization, a graph Transformer, and a unified continuous-discrete flow objective to enable efficient message passing and few-step inference. Evaluated on Visual Genome and PSG benchmarks under both closed-set and open-vocabulary settings, the method achieves consistent improvements of approximately 3 points in predicate recall and graph-level metrics, significantly outperforming existing approaches.

Technology Category

Application Category

📝 Abstract

Scene Graph Generation (SGG) unifies object localization and visual relationship reasoning by predicting boxes and subject-predicate-object triples. Yet most pipelines treat SGG as a one-shot, deterministic classification problem rather than a genuinely progressive, generative task. We propose FlowSG, which recasts SGG as continuous-time transport on a hybrid discrete-continuous state: starting from a noised graph, the model progressively grows an image-conditioned scene graph through constraint-aware refinements that jointly synthesize nodes (objects) and edges (predicates). Specifically, we first leverage a VQ-VAE to quantize a scene graph (e.g., continuous visual features) into compact, predictable tokens; a graph Transformer then (i) predicts a conditional velocity field to transport continuous geometry (boxes) and (ii) updates discrete posteriors for categorical tokens (object features and predicate labels), coupling semantics and geometry via flow-conditioned message aggregation. Training combines flow-matching losses for geometry with a discrete-flow objective for tokens, yielding few-step inference and plug-and-play compatibility with standard detectors and segmenters. Extensive experiments on VG and PSG under closed- and open-vocabulary protocols show consistent gains in predicate R/mR and graph-level metrics, validating the mixed discrete-continuous generative formulation over one-shot classification baselines, with an average improvement of about 3 points over the state-of-the-art USG-Par.

Problem

Research questions and friction points this paper is trying to address.

Scene Graph Generation

Generative Modeling

Discrete-Continuous Representation

Visual Relationship Reasoning

Progressive Generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Flow Matching

Scene Graph Generation

Discrete-Continuous Generative Modeling