🤖 AI Summary
This work addresses the limitations of conventional scene graph generation methods, which typically frame the task as a one-shot deterministic classification and thus fail to capture the joint dynamic generation of objects and their relationships. The paper introduces flow matching to this domain for the first time, reformulating it as a continuous-time optimal transport problem in a hybrid discrete-continuous state space. Starting from a noisy graph, the model progressively generates complete scene graphs—integrating geometric (bounding boxes) and semantic (object categories and predicates) information—guided by image-conditioned signals. The approach combines VQ-VAE quantization, a graph Transformer, and a unified continuous-discrete flow objective to enable efficient message passing and few-step inference. Evaluated on Visual Genome and PSG benchmarks under both closed-set and open-vocabulary settings, the method achieves consistent improvements of approximately 3 points in predicate recall and graph-level metrics, significantly outperforming existing approaches.
📝 Abstract
Scene Graph Generation (SGG) unifies object localization and visual relationship reasoning by predicting boxes and subject-predicate-object triples. Yet most pipelines treat SGG as a one-shot, deterministic classification problem rather than a genuinely progressive, generative task. We propose FlowSG, which recasts SGG as continuous-time transport on a hybrid discrete-continuous state: starting from a noised graph, the model progressively grows an image-conditioned scene graph through constraint-aware refinements that jointly synthesize nodes (objects) and edges (predicates). Specifically, we first leverage a VQ-VAE to quantize a scene graph (e.g., continuous visual features) into compact, predictable tokens; a graph Transformer then (i) predicts a conditional velocity field to transport continuous geometry (boxes) and (ii) updates discrete posteriors for categorical tokens (object features and predicate labels), coupling semantics and geometry via flow-conditioned message aggregation. Training combines flow-matching losses for geometry with a discrete-flow objective for tokens, yielding few-step inference and plug-and-play compatibility with standard detectors and segmenters. Extensive experiments on VG and PSG under closed- and open-vocabulary protocols show consistent gains in predicate R/mR and graph-level metrics, validating the mixed discrete-continuous generative formulation over one-shot classification baselines, with an average improvement of about 3 points over the state-of-the-art USG-Par.