PhyMix: Towards Physically Consistent Single-Image 3D Indoor Scene Generation with Implicit--Explicit Optimization

📅 2026-04-11

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing single-image 3D indoor scene generation methods often disregard physical laws, limiting their reliability for applications such as robotics and embodied intelligence. To address this, this work proposes PhyMix, a novel framework that introduces the first four-dimensional, nine-subcategory benchmark for evaluating physical consistency—encompassing geometric priors, contact, stability, and deployability. The framework incorporates Scene-GRPO, a critic-free group relative policy optimization algorithm, and a plug-in test-time optimizer (TTO) to achieve implicit alignment during training and explicit correction at inference. By integrating a differentiable physics-aware evaluator with preference-guided sampling, PhyMix attains state-of-the-art performance in both visual fidelity and physical plausibility across synthetic, stylized, and real-world images.

Technology Category

Application Category

📝 Abstract

Existing single-image 3D indoor scene generators often produce results that look visually plausible but fail to obey real-world physics, limiting their reliability in robotics, embodied AI, and design. To examine this gap, we introduce a unified Physics Evaluator that measures four main aspects: geometric priors, contact, stability, and deployability, which are further decomposed into nine sub-constraints, establishing the first benchmark to measure physical consistency. Based on this evaluator, our analysis shows that state-of-the-art methods remain largely physics-unaware. To overcome this limitation, we further propose a framework that integrates feedback from the Physics Evaluator into both training and inference, enhancing the physical plausibility of generated scenes. Specifically, we propose PhyMix, which is composed of two complementary components: (i) implicit alignment via Scene-GRPO, a critic-free group-relative policy optimization that leverages the Physics Evaluator as a preference signal and biases sampling towards physically feasible layouts, and (ii) explicit refinement via a plug-and-play Test-Time Optimizer (TTO) that uses differentiable evaluator signals to correct residual violations during generation. Overall, our method unifies evaluation, reward shaping, and inference-time correction, producing 3D indoor scenes that are visually faithful and physically plausible. Extensive synthetic evaluations confirm state-of-the-art performance in both visual fidelity and physical plausibility, and extensive qualitative examples in stylized and real-world images further showcase the robustness of the method. We will release codes and models upon publication.

Problem

Research questions and friction points this paper is trying to address.

physical consistency

single-image 3D generation

indoor scene

physics-aware generation

3D scene plausibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

physical consistency

implicit-explicit optimization

Physics Evaluator