SceneEval: Evaluating Semantic Coherence in Text-Conditioned 3D Indoor Scene Synthesis

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing text-conditioned 3D indoor scene generation methods lack systematic evaluation of text–scene semantic alignment; prevailing metrics emphasize geometric fidelity while neglecting text adherence. Method: We propose the first semantic consistency evaluation framework for this task, uniquely decoupling *text alignment* (explicit matching of objects and attributes) from *geometric plausibility* (implicit constraints such as collision-free layout and spatial coherence). Our approach integrates CLIP-based semantic embedding comparison, 3D spatial relation reasoning, physics-aware validation, and attribute-level matching. Contribution/Results: We design a quantifiable, decomposable, multi-dimensional metric suite and release SceneEval-100—a benchmark dataset with fine-grained annotations. Experiments reveal that state-of-the-art methods satisfy only 58% of explicit textual requirements on average. The framework provides a reproducible, diagnosable, standardized evaluation tool to advance the field.

Technology Category

Application Category

📝 Abstract

Despite recent advances in text-conditioned 3D indoor scene generation, there remain gaps in the evaluation of these methods. Existing metrics primarily assess the realism of generated scenes by comparing them to a set of ground-truth scenes, often overlooking alignment with the input text - a critical factor in determining how effectively a method meets user requirements. We present SceneEval, an evaluation framework designed to address this limitation. SceneEval includes metrics for both explicit user requirements, such as the presence of specific objects and their attributes described in the input text, and implicit expectations, like the absence of object collisions, providing a comprehensive assessment of scene quality. To facilitate evaluation, we introduce SceneEval-100, a dataset of scene descriptions with annotated ground-truth scene properties. We evaluate recent scene generation methods using SceneEval and demonstrate its ability to provide detailed assessments of the generated scenes, highlighting strengths and areas for improvement across multiple dimensions. Our results show that current methods struggle at generating scenes that meet user requirements, underscoring the need for further research in this direction.

Problem

Research questions and friction points this paper is trying to address.

Evaluates semantic coherence in text-conditioned 3D scene synthesis.

Addresses gaps in assessing alignment with input text requirements.

Highlights limitations in current methods meeting user expectations.

Innovation

Methods, ideas, or system contributions that make the work stand out.

SceneEval framework evaluates text-scene alignment

Includes explicit and implicit scene quality metrics

Introduces SceneEval-100 dataset for comprehensive evaluation

🔎 Similar Papers

No similar papers found.