🤖 AI Summary
Generative AI has long struggled with precise spatial composition control in image generation, particularly in enabling fine-grained, interactive manipulation of object layout and scene conditions. To address this, we propose a virtual canvas framework grounded in a real-time 3D engine: it parses textual inputs into manipulable 3D object instances, enabling intuitive user interaction—including drag-and-drop, scaling, and constraint specification—and translates spatial intent into structured geometric constraints that guide diffusion-based image synthesis. Our key contribution is the first integration of a real-time 3D interactive engine as a human–AI co-design medium for spatial modeling, unifying natural language understanding, 3D instantiation, spatial relation reasoning, and controllable image generation. Experiments demonstrate significant improvements over state-of-the-art baselines: +32.7% IoU in spatial accuracy, 41% reduction in task completion time, and higher user satisfaction—validated through open-ended real-world usability testing.
📝 Abstract
Generative AI (GenAI) has significantly advanced the ease and flexibility of image creation. However, it remains a challenge to precisely control spatial compositions, including object arrangement and scene conditions. To bridge this gap, we propose Canvas3D, an interactive system leveraging a 3D engine to enable precise spatial manipulation for image generation. Upon user prompt, Canvas3D automatically converts textual descriptions into interactive objects within a 3D engine-driven virtual canvas, empowering direct and precise spatial configuration. These user-defined arrangements generate explicit spatial constraints that guide generative models in accurately reflecting user intentions in the resulting images. We conducted a closed-end comparative study between Canvas3D and a baseline system. And an open-ended study to evaluate our system "in the wild". The result indicates that Canvas3D outperforms the baseline on spatial control, interactivity, and overall user experience.