NuiWorld: Exploring a Scalable Framework for End-to-End Controllable World Generation

๐Ÿ“… 2026-01-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work proposes NuiWorld, a novel framework addressing key limitations of existing end-to-end world generation methodsโ€”namely data scarcity, fixed-resolution representations, and high inference overhead. NuiWorld leverages generative bootstrapping to synthesize diverse scene data from a small set of images and incorporates pseudo-sketch labels to enable layout-controllable generation. It introduces a flattened, variable-size vector set representation that supports efficient and consistent multi-scale scene modeling. The framework substantially improves both training and inference efficiency while demonstrating strong generalization to unseen sketches. NuiWorld generates virtual worlds with high geometric fidelity and precise layout control, advancing the state of the art in controllable 3D scene synthesis.

Technology Category

Application Category

๐Ÿ“ Abstract
World generation is a fundamental capability for applications like video games, simulation, and robotics. However, existing approaches face three main obstacles: controllability, scalability, and efficiency. End-to-end scene generation models have been limited by data scarcity. While object-centric generation approaches rely on fixed resolution representations, degrading fidelity for larger scenes. Training-free approaches, while flexible, are often slow and computationally expensive at inference time. We present NuiWorld, a framework that attempts to address these challenges. To overcome data scarcity, we propose a generative bootstrapping strategy that starts from a few input images. Leveraging recent 3D reconstruction and expandable scene generation techniques, we synthesize scenes of varying sizes and layouts, producing enough data to train an end-to-end model. Furthermore, our framework enables controllability through pseudo sketch labels, and demonstrates a degree of generalization to previously unseen sketches. Our approach represents scenes as a collection of variable scene chunks, which are compressed into a flattened vector-set representation. This significantly reduces the token length for large scenes, enabling consistent geometric fidelity across scenes sizes while improving training and inference efficiency.
Problem

Research questions and friction points this paper is trying to address.

controllability
scalability
efficiency
data scarcity
scene generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

generative bootstrapping
scene chunks
vector-set representation
controllable generation
scalable world generation
๐Ÿ”Ž Similar Papers
H
Han-Hung Lee
Simon Fraser University
C
Cheng-Yu Yang
National Yang Ming Chiao Tung University
Yu-Lun Liu
Yu-Lun Liu
Assistant Professor, National Yang Ming Chiao Tung University
Computer VisionImage ProcessingMachine LearningDeep LearningComputational Photography
A
Angel X. Chang
Simon Fraser University, CIFAR AI Chair, Amii