SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing text-to-3D scene generation methods are often constrained by domain-specific assumptions or predefined spatial relationships, limiting their ability to support open-vocabulary, unconstrained scene synthesis. This work proposes a vision-feedback-driven agent framework that, for the first time, integrates the spatial reasoning capabilities of vision-language models (VLMs) with 3D object generation models. By leveraging atomic operations—such as scaling, rotation, and focus—and a render-and-refine feedback loop, the framework enables iterative, natural language–guided 3D scene generation and editing. The approach accepts open-vocabulary inputs and produces semantically aligned, high-fidelity, and diverse 3D scenes, significantly outperforming existing methods in both human evaluations and qualitative analysis.

Technology Category

Application Category

📝 Abstract

Text-to-3D scene generation from natural language is highly desirable for digital content creation. However, existing methods are largely domain-restricted or reliant on predefined spatial relationships, limiting their capacity for unconstrained, open-vocabulary 3D scene synthesis. In this paper, we introduce SceneAssistant, a visual-feedback-driven agent designed for open-vocabulary 3D scene generation. Our framework leverages modern 3D object generation model along with the spatial reasoning and planning capabilities of Vision-Language Models (VLMs). To enable open-vocabulary scene composition, we provide the VLMs with a comprehensive set of atomic operations (e.g., Scale, Rotate, FocusOn). At each interaction step, the VLM receives rendered visual feedback and takes actions accordingly, iteratively refining the scene to achieve more coherent spatial arrangements and better alignment with the input text. Experimental results demonstrate that our method can generate diverse, open-vocabulary, and high-quality 3D scenes. Both qualitative analysis and quantitative human evaluations demonstrate the superiority of our approach over existing methods. Furthermore, our method allows users to instruct the agent to edit existing scenes based on natural language commands. Our code is available at https://github.com/ROUJINN/SceneAssistant

Problem

Research questions and friction points this paper is trying to address.

Text-to-3D

open-vocabulary

3D scene generation

spatial relationships

natural language

Innovation

Methods, ideas, or system contributions that make the work stand out.

open-vocabulary 3D generation

visual feedback agent

vision-language models