World Craft: Agentic Framework to Create Visualizable Worlds via Text

πŸ“… 2026-01-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge non-programming users face in creating executable, visual AI simulation environments through natural language. The authors propose an agent-based framework that automatically generates AI Town–like environments from user-provided textual descriptions, featuring a structured scene construction module (World Scaffold) and a multi-agent intent interpretation module (World Guild). Innovatively, the approach introduces a structured world-generation protocol tailored for non-expert users, integrates a multi-agent collaborative parsing mechanism with a reverse-engineered error-correction dataset, and incorporates spatial knowledge enhancement techniques. Experimental results demonstrate that the method significantly outperforms code-centric agents such as Cursor and Antigravity, as well as large language models including Qwen3 and Gemini-3-Pro, in both scene layout stability and narrative expressiveness, offering a scalable solution toward democratizing environment creation.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) motivate generative agent simulation (e.g., AI Town) to create a ``dynamic world'', holding immense value across entertainment and research. However, for non-experts, especially those without programming skills, it isn't easy to customize a visualizable environment by themselves. In this paper, we introduce World Craft, an agentic world creation framework to create an executable and visualizable AI Town via user textual descriptions. It consists of two main modules, World Scaffold and World Guild. World Scaffold is a structured and concise standardization to develop interactive game scenes, serving as an efficient scaffolding for LLMs to customize an executable AI Town-like environment. World Guild is a multi-agent framework to progressively analyze users'intents from rough descriptions, and synthesizes required structured contents (\eg environment layout and assets) for World Scaffold . Moreover, we construct a high-quality error-correction dataset via reverse engineering to enhance spatial knowledge and improve the stability and controllability of layout generation, while reporting multi-dimensional evaluation metrics for further analysis. Extensive experiments demonstrate that our framework significantly outperforms existing commercial code agents (Cursor and Antigravity) and LLMs (Qwen3 and Gemini-3-Pro). in scene construction and narrative intent conveyance, providing a scalable solution for the democratization of environment creation.
Problem

Research questions and friction points this paper is trying to address.

generative agent simulation
visualizable world
text-to-environment
non-expert customization
AI Town
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic Framework
Visualizable World Generation
World Scaffold
Multi-agent Intent Analysis
Error-correction Dataset
πŸ”Ž Similar Papers
No similar papers found.
Jianwen Sun
Jianwen Sun
Software Engineering Application Technology Lab, Huawei, China
Software engineeringDeep reinforcement learning
Y
Yukang Feng
Shanda AI Research, Tokyo
Kaining Ying
Kaining Ying
Fudan University
C
Chuanhao Li
Shanghai AI Laboratory
Z
Zizhen Li
Shanda AI Research, Tokyo
F
Fanrui Zhang
USTC
J
Jiaxin Ai
Fudan University
Y
Yifan Chang
USTC
Y
Yu Dai
Nankai University
Yifei Huang
Yifei Huang
The University of Tokyo
egocentric visiongazevideo understandingembodied ai
Kaipeng Zhang
Kaipeng Zhang
Shanghai AI Laboratory
LLMMultimodal LLMsAIGC