LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation

📅 2025-09-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional manual 3D modeling is inefficient and insufficient for constructing large-scale, dynamic, interactive 3D worlds. Method: This paper proposes a lightweight multimodal framework that tightly integrates LLaMA-2-7B with Unreal Engine 5’s rendering pipeline, enabling end-to-end 3D scene generation from text or visual instructions. A lightweight LLM parses multimodal inputs and coordinates with a physics engine to achieve high-fidelity dynamic simulation, while UE5 delivers real-time rendering and agent interaction. Contribution/Results: We introduce the first scalable, interactive, and dynamically evolving multimodal 3D generation paradigm. Our approach significantly outperforms baselines in layout accuracy and visual fidelity, improves production efficiency by over 90×, and maintains high creative controllability. The framework is applicable to critical domains including embodied AI and autonomous driving simulation.

Technology Category

Application Category

📝 Abstract
Recent research has been increasingly focusing on developing 3D world models that simulate complex real-world scenarios. World models have found broad applications across various domains, including embodied AI, autonomous driving, entertainment, etc. A more realistic simulation with accurate physics will effectively narrow the sim-to-real gap and allow us to gather rich information about the real world conveniently. While traditional manual modeling has enabled the creation of virtual 3D scenes, modern approaches have leveraged advanced machine learning algorithms for 3D world generation, with most recent advances focusing on generative methods that can create virtual worlds based on user instructions. This work explores such a research direction by proposing LatticeWorld, a simple yet effective 3D world generation framework that streamlines the industrial production pipeline of 3D environments. LatticeWorld leverages lightweight LLMs (LLaMA-2-7B) alongside the industry-grade rendering engine (e.g., Unreal Engine 5) to generate a dynamic environment. Our proposed framework accepts textual descriptions and visual instructions as multimodal inputs and creates large-scale 3D interactive worlds with dynamic agents, featuring competitive multi-agent interaction, high-fidelity physics simulation, and real-time rendering. We conduct comprehensive experiments to evaluate LatticeWorld, showing that it achieves superior accuracy in scene layout generation and visual fidelity. Moreover, LatticeWorld achieves over a $90 imes$ increase in industrial production efficiency while maintaining high creative quality compared with traditional manual production methods. Our demo video is available at https://youtu.be/8VWZXpERR18
Problem

Research questions and friction points this paper is trying to address.

Generating interactive 3D worlds with multimodal inputs
Streamlining industrial 3D environment production pipeline
Enhancing simulation realism with physics and interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal LLM and Unreal Engine integration
Generates interactive 3D worlds from text and visual inputs
Achieves 90x efficiency gain over manual production
🔎 Similar Papers
No similar papers found.
Yinglin Duan
Yinglin Duan
Netease
video gamecomputer visiondancemachine learning
Zhengxia Zou
Zhengxia Zou
Beihang Univeristy
computer visionimage processingremote sensinggames
T
Tongwei Gu
NetEase, Inc., China
W
Wei Jia
NetEase, Inc., China
Z
Zhan Zhao
NetEase, Inc., China
Luyi Xu
Luyi Xu
Work done while at NetEase, Inc., China
X
Xinzhu Liu
Tsinghua University, China
H
Hao Jiang
NetEase, Inc., China
K
Kang Chen
NetEase, Inc., China
Shuang Qiu
Shuang Qiu
City University of Hong Kong
Reinforcement LearningAgentic AILarge Language ModelsEmbodied AI