🤖 AI Summary
Existing indoor scene generation methods struggle to simultaneously achieve high photorealism, fine-grained object-level control, and global style consistency. To address this challenge, this work proposes a tri-branch collaborative generation model that uniquely integrates multimodal graph conditioning with rectified flow mechanisms. Specifically, a multimodal graph neural network models inter-object relationships, while tightly coupled rectified flows across layout, shape, and texture branches enable dynamic interaction of object information and style alignment during generation. The proposed approach significantly outperforms current language- or graph-conditioned baselines in terms of photorealism, style coherence, and human preference, achieving synergistic optimization between object-level precision and scene-level stylistic unity.
📝 Abstract
Scene generation has extensive industrial applications, demanding both high realism and precise control over geometry and appearance. Language-driven retrieval methods compose plausible scenes from a large object database, but overlook object-level control and often fail to enforce scene-level style coherence. Graph-based formulations offer higher controllability over objects and inform holistic consistency by explicitly modeling relations, yet existing methods struggle to produce high-fidelity textured results, thereby limiting their practical utility. We present FlowScene, a tri-branch scene generative model conditioned on multimodal graphs that collaboratively generates scene layouts, object shapes, and object textures. At its core lies a tight-coupled rectified flow model that exchanges object information during generation, enabling collaborative reasoning across the graph. This enables fine-grained control of objects' shapes, textures, and relations while enforcing scene-level style coherence across structure and appearance. Extensive experiments show that FlowScene outperforms both language-conditioned and graph-conditioned baselines in terms of generation realism, style consistency, and alignment with human preferences.