🤖 AI Summary
Text-to-image models struggle to generate escape-room puzzle images that are simultaneously visually compelling, logically coherent, and intellectually challenging.
Method: We propose the first hierarchical multi-agent framework specifically designed for escape-room puzzle generation. It decomposes the task into four sequential stages: functional design, symbolic scene-graph reasoning, layout synthesis, and localized iterative editing—integrating symbolic AI (for explicit spatial and functional relationship modeling), multi-agent collaborative optimization, feedback-driven image editing, and joint visual–logical evaluation.
Results: Experiments demonstrate significant improvements in puzzle solvability and object manipulability, effective elimination of spurious shortcuts, and preservation of high-fidelity visual quality. Our framework establishes a novel paradigm for embodied-reasoning–oriented text-to-image generation, advancing beyond conventional aesthetic or semantic fidelity toward structured, interactive, and logically grounded visual synthesis.
📝 Abstract
We challenge text-to-image models with generating escape room puzzle images that are visually appealing, logically solid, and intellectually stimulating. While base image models struggle with spatial relationships and affordance reasoning, we propose a hierarchical multi-agent framework that decomposes this task into structured stages: functional design, symbolic scene graph reasoning, layout synthesis, and local image editing. Specialized agents collaborate through iterative feedback to ensure the scene is visually coherent and functionally solvable. Experiments show that agent collaboration improves output quality in terms of solvability, shortcut avoidance, and affordance clarity, while maintaining visual quality.