🤖 AI Summary
Existing single-image 3D indoor scene generation methods often disregard physical laws, limiting their reliability for applications such as robotics and embodied intelligence. To address this, this work proposes PhyMix, a novel framework that introduces the first four-dimensional, nine-subcategory benchmark for evaluating physical consistency—encompassing geometric priors, contact, stability, and deployability. The framework incorporates Scene-GRPO, a critic-free group relative policy optimization algorithm, and a plug-in test-time optimizer (TTO) to achieve implicit alignment during training and explicit correction at inference. By integrating a differentiable physics-aware evaluator with preference-guided sampling, PhyMix attains state-of-the-art performance in both visual fidelity and physical plausibility across synthetic, stylized, and real-world images.
📝 Abstract
Existing single-image 3D indoor scene generators often produce results that look visually plausible but fail to obey real-world physics, limiting their reliability in robotics, embodied AI, and design. To examine this gap, we introduce a unified Physics Evaluator that measures four main aspects: geometric priors, contact, stability, and deployability, which are further decomposed into nine sub-constraints, establishing the first benchmark to measure physical consistency. Based on this evaluator, our analysis shows that state-of-the-art methods remain largely physics-unaware. To overcome this limitation, we further propose a framework that integrates feedback from the Physics Evaluator into both training and inference, enhancing the physical plausibility of generated scenes. Specifically, we propose PhyMix, which is composed of two complementary components: (i) implicit alignment via Scene-GRPO, a critic-free group-relative policy optimization that leverages the Physics Evaluator as a preference signal and biases sampling towards physically feasible layouts, and (ii) explicit refinement via a plug-and-play Test-Time Optimizer (TTO) that uses differentiable evaluator signals to correct residual violations during generation. Overall, our method unifies evaluation, reward shaping, and inference-time correction, producing 3D indoor scenes that are visually faithful and physically plausible. Extensive synthetic evaluations confirm state-of-the-art performance in both visual fidelity and physical plausibility, and extensive qualitative examples in stylized and real-world images further showcase the robustness of the method. We will release codes and models upon publication.