Joint Generative Modeling of Scene Graphs and Images via Diffusion Models

📅 2024-01-02
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
This paper introduces the first unconditional joint generation task of scene graphs and corresponding images, aiming to simultaneously synthesize structured scene graphs—comprising object categories, bounding boxes, and relational triplets—and photorealistic images from noise, enabling controllable and interpretable visual content generation. To this end, we propose DiffuseSG: a graph Transformer-based diffusion denoiser that unifies modeling of nodes (categories + coordinates), edges (relations), and adjacency matrices. We introduce IoU regularization and a continuous–discrete co-optimization mechanism, and pioneer the embedding of discrete category labels into a continuous latent space for joint diffusion modeling. Evaluated on Visual Genome and COCO-Stuff, DiffuseSG significantly outperforms state-of-the-art methods in both joint generation quality and fidelity. Moreover, it improves downstream scene graph completion and object detection performance, and generates high-fidelity samples that enhance model training through data augmentation.

Technology Category

Application Category

📝 Abstract
In this paper, we present a novel generative task: joint scene graph - image generation. While previous works have explored image generation conditioned on scene graphs or layouts, our task is distinctive and important as it involves generating scene graphs themselves unconditionally from noise, enabling efficient and interpretable control for image generation. Our task is challenging, requiring the generation of plausible scene graphs with heterogeneous attributes for nodes (objects) and edges (relations among objects), including continuous object bounding boxes and discrete object and relation categories. We introduce a novel diffusion model, DiffuseSG, that jointly models the adjacency matrix along with heterogeneous node and edge attributes. We explore various types of encodings for the categorical data, relaxing it into a continuous space. With a graph transformer being the denoiser, DiffuseSG successively denoises the scene graph representation in a continuous space and discretizes the final representation to generate the clean scene graph. Additionally, we introduce an IoU regularization to enhance the empirical performance. Our model significantly outperforms existing methods in scene graph generation on the Visual Genome and COCO-Stuff datasets, both on standard and newly introduced metrics that better capture the problem complexity. Moreover, we demonstrate the additional benefits of our model in two downstream applications: 1) excelling in a series of scene graph completion tasks, and 2) improving scene graph detection models by using extra training samples generated from DiffuseSG.
Problem

Research questions and friction points this paper is trying to address.

Generating grounded scene graphs from noise for interpretable control
Modeling heterogeneous node and edge attributes in scene graphs
Improving performance in scene graph generation and downstream tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion model for joint scene graph generation
Graph transformer denoiser for continuous refinement
IoU regularization to enhance performance