3D-Generalist: Self-Improving Vision-Language-Action Models for Crafting 3D Worlds

📅 2025-07-08

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Current large models exhibit limited spatial reasoning capabilities due to the absence of 3D world data, while manually constructing immersive 3D environments (e.g., VR, games, robotic simulations) incurs prohibitive costs. To address this, we propose the first framework that formulates 3D scene generation as a sequential decision-making problem, employing a vision-language model (VLM) as a policy controller to jointly generate layout, materials, lighting, and assets. We further introduce a self-improving reinforcement learning mechanism enabling text-driven iterative optimization. Our approach enables high-fidelity, scalable, and automatic 3D environment construction; the synthesized data serves effectively for pretraining vision foundation models, outperforming handcrafted synthetic data on downstream tasks and approaching performance achieved under real-data supervision. Key innovations include a VLM-driven end-to-end 3D generation paradigm and a self-optimizing training framework.

Technology Category

Application Category

📝 Abstract

Despite large-scale pretraining endowing models with language and vision reasoning capabilities, improving their spatial reasoning capability remains challenging due to the lack of data grounded in the 3D world. While it is possible for humans to manually create immersive and interactive worlds through 3D graphics, as seen in applications such as VR, gaming, and robotics, this process remains highly labor-intensive. In this paper, we propose a scalable method for generating high-quality 3D environments that can serve as training data for foundation models. We recast 3D environment building as a sequential decision-making problem, employing Vision-Language-Models (VLMs) as policies that output actions to jointly craft a 3D environment's layout, materials, lighting, and assets. Our proposed framework, 3D-Generalist, trains VLMs to generate more prompt-aligned 3D environments via self-improvement fine-tuning. We demonstrate the effectiveness of 3D-Generalist and the proposed training strategy in generating simulation-ready 3D environments. Furthermore, we demonstrate its quality and scalability in synthetic data generation by pretraining a vision foundation model on the generated data. After fine-tuning the pre-trained model on downstream tasks, we show that it surpasses models pre-trained on meticulously human-crafted synthetic data and approaches results achieved with real data orders of magnitude larger.

Problem

Research questions and friction points this paper is trying to address.

Enhancing spatial reasoning in models lacking 3D data

Automating labor-intensive 3D world creation for VR/gaming

Generating scalable synthetic 3D data for foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLMs as policies for 3D environment building

Self-improvement fine-tuning for prompt-aligned 3D worlds

Scalable synthetic data generation for foundation models

🔎 Similar Papers

No similar papers found.