LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current 3D scene generation methods suffer from coarse-grained instruction conditioning, leading to spatial layout distortions and inaccurate object attribute rendering, compounded by the absence of reliable consistency evaluation protocols. To address this, we propose LEGO-Eval—a multimodal evaluation framework—and LEGO-Bench—a dedicated benchmark—leveraging synergistic integration of large language models (LLMs), vision-language models (VLMs), and domain-specific analytical tools to enable fine-grained, explicit localization and verification of scene structure, layout, and attributes. Our key contribution is the first multi-tool collaborative system supporting instruction-to-scene fine-grained alignment assessment. Experiments on LEGO-Bench demonstrate that LEGO-Eval achieves a +0.41 F1 improvement over VLM-as-a-judge baselines, revealing that state-of-the-art methods achieve only ~10% instruction alignment—exposing a fundamental limitation. This establishes a more rigorous foundation for evaluating and advancing embodied intelligence training via high-fidelity 3D scene generation.

Technology Category

Application Category

📝 Abstract
Despite recent progress in using Large Language Models (LLMs) for automatically generating 3D scenes, generated scenes often lack realistic spatial layouts and object attributes found in real-world environments. As this problem stems from insufficiently detailed, coarse-grained instructions, advancing 3D scene synthesis guided by more detailed, fine-grained instructions that reflect real-world environments becomes crucial. Without such realistic scenes, training embodied agents in unrealistic environments can lead them to learn priors that diverge significantly from real-world physics and semantics, degrading their performance when deployed. Thus, verifying the alignment between the fine-grained instruction and the generated scene is essential for effective learning. However, current evaluation methods, such as CLIPScore and vision-language models (VLMs), often fail to reliably assess such alignment. This shortcoming arises primarily from their shallow understanding of 3D scenes, which often leads to improperly grounded scene components. To address this, we introduce LEGO-Eval, an evaluation framework equipped with diverse tools designed to explicitly ground scene components, enabling more accurate alignment assessments. We also present LEGO-Bench, a benchmark of detailed instructions that specify complex layouts and attributes of real-world environments. Experiments demonstrate that LEGO-Eval outperforms VLM-as-a-judge by 0.41 F1 score in assessing scene-instruction alignment. Benchmarking with LEGO-Bench reveals significant limitations in current generation methods. Across all evaluated approaches, success rates reached at most 10% in generating scenes that fully align with fine-grained instructions.
Problem

Research questions and friction points this paper is trying to address.

Evaluating alignment between fine-grained instructions and generated 3D scenes
Addressing unrealistic spatial layouts and object attributes in synthesized environments
Overcoming shallow 3D scene understanding in current evaluation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluation framework with tool augmentation for 3D scenes
Explicit grounding of scene components for alignment assessment
Benchmark with detailed instructions for real-world environments