🤖 AI Summary
Addressing challenges in AR and embodied intelligence—namely multi-view illumination inconsistency, shadow distortion, and poor scalability of inverse rendering for object synthesis—this paper proposes a two-stage feed-forward synthesis framework. Stage one achieves geometric-semantic alignment between 2D images and 3D Gaussian scenes via Hilbert curve mapping; stage two bypasses iterative diffusion and directly predicts illumination and shadows for efficient, photorealistic synthesis. Key contributions include: (1) the first large-scale benchmark dataset specifically designed for 3D synthesis; (2) the first lightweight framework integrating Hilbert-space mapping with feed-forward inverse rendering; and (3) state-of-the-art harmony scores on both standard and custom benchmarks, enabling real-time inference and demonstrating strong generalization and robustness on real-world smartphone-captured scenes.
📝 Abstract
Object compositing offers significant promise for augmented reality (AR) and embodied intelligence applications. Existing approaches predominantly focus on single-image scenarios or intrinsic decomposition techniques, facing challenges with multi-view consistency, complex scenes, and diverse lighting conditions. Recent inverse rendering advancements, such as 3D Gaussian and diffusion-based methods, have enhanced consistency but are limited by scalability, heavy data requirements, or prolonged reconstruction time per scene. To broaden its applicability, we introduce MV-CoLight, a two-stage framework for illumination-consistent object compositing in both 2D images and 3D scenes. Our novel feed-forward architecture models lighting and shadows directly, avoiding the iterative biases of diffusion-based methods. We employ a Hilbert curve-based mapping to align 2D image inputs with 3D Gaussian scene representations seamlessly. To facilitate training and evaluation, we further introduce a large-scale 3D compositing dataset. Experiments demonstrate state-of-the-art harmonized results across standard benchmarks and our dataset, as well as casually captured real-world scenes demonstrate the framework's robustness and wide generalization.