Dependency-Aware Discrete Diffusion for Scene Graph Generation

📅 2026-05-09
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
This work addresses the challenge of generating scene graphs with hierarchical structure and strong dependencies from natural language by proposing a dependency-aware, hierarchy-constrained discrete diffusion model. It introduces, for the first time, dependency-aware mechanisms and hierarchical constraints into a discrete diffusion framework, decoupling structural and semantic modeling to separately handle conditional dependencies among objects, edges, and relationships during both forward and reverse diffusion processes. The approach further enables text-aligned sampling without requiring additional training. Experimental results demonstrate that the proposed model outperforms existing continuous and discrete graph generation methods on standard scene graph benchmarks, achieving superior performance in both graph structure and layout metrics. When applied to downstream image generation tasks, it significantly improves compositional alignment quality in multi-object scenes.
📝 Abstract
Scene graphs (SGs) represent objects and their relationships as structured graphs, enabling applications in image generation, robotics, and 3D understanding. Recent work suggests that conditioning image generation on scene graphs improves compositional fidelity compared to text-only prompting. However, since users typically provide text rather than structured graphs, a key challenge is to generate scene graphs from natural language. Prior work on discrete diffusion has demonstrated success in generating generic graphs such as molecules and circuits, but fails to account for the hierarchical structure and strong dependencies between objects, edges, and relations in scene graphs. We address this limitation by introducing a dependency-aware, hierarchically constrained discrete diffusion model for scene graph generation. Our approach decouples structure and semantics across the forward and reverse processes, enabling the model to capture conditional dependencies. At inference time, we perform training-free conditioning to sample text-aligned scene graphs. We evaluate our method on standard SG benchmarks and demonstrate improvements over both continuous and discrete graph generation baselines across graph and layout metrics. When fed to downstream image generation, our approach yields improved compositional alignment compared to text-to-image models, particularly in multi-object scenarios.
Problem

Research questions and friction points this paper is trying to address.

scene graph generation
natural language to structured representation
hierarchical dependencies
discrete diffusion
compositional scene understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

dependency-aware diffusion
scene graph generation
discrete diffusion
hierarchical constraints
text-to-graph alignment