How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses a critical gap in the evaluation of vision-language models, which has predominantly emphasized visual plausibility while neglecting physical reasoning capabilities in generative tasks. To bridge this gap, we introduce DreamHouse, a novel benchmark centered on residential timber-frame construction. Built upon over 26,000 BIM samples conforming to LOD 350 standards and validated through ten deterministic structural tests, DreamHouse establishes a dynamic evaluation framework that supports agent-environment interaction with iterative feedback. For the first time, constructability, structural compliance, and geometric constraints are formally integrated into a multimodal generative assessment paradigm through a multi-dimensional formulation of physical constraints. Experimental results reveal significant blind spots in mainstream models regarding physical validity—a dimension orthogonal to conventional visual realism—highlighting the urgent need to incorporate such criteria into generative model evaluation protocols.

Technology Category

Application Category

📝 Abstract

The physical world is not merely visual; it is governed by rigorous structural and procedural constraints. Yet, the evaluation of vision-language models (VLMs) remains heavily skewed toward perceptual realism, prioritizing the generation of visually plausible 3D layouts, shapes, and appearances. Current benchmarks rarely test whether models grasp the step-by-step processes and physical dependencies required to actually build these artifacts, a capability essential for automating design-to-construction pipelines. To address this, we introduce DreamHouse, a novel benchmark for physical generative reasoning: the capacity to synthesize artifacts that concurrently satisfy geometric, structural, constructability, and code-compliance constraints. We ground this benchmark in residential timber-frame construction, a domain with fully codified engineering standards and objectively verifiable correctness. We curate over 26,000 structures spanning 13 architectural styles, ach verified to construction-document standards (LOD 350) and develop a deterministic 10-test structural validation framework. Unlike static benchmarks that assess only final outputs, DreamHouse supports iterative agentic interaction. Models observe intermediate build states, generate construction actions, and receive structured environmental feedback, enabling a fine-grained evaluation of planning, structural reasoning, and self-correction. Extensive experiments with state-of-the-art VLMs reveal substantial capability gaps that are largely invisible on existing leaderboards. These findings establish physical validity as a critical evaluation axis orthogonal to visual realism, highlighting physical generative reasoning as a distinct and underdeveloped frontier in multimodal intelligence. Available at https://luluyuyuyang.github.io/dreamhouse

Problem

Research questions and friction points this paper is trying to address.

physical generative reasoning

vision-language models

constructability

benchmark

structural constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

physical generative reasoning

constructability

vision-language models