🤖 AI Summary
Virtual furniture synthesis suffers from the dual challenges of lacking standardized benchmarks and difficulty in simultaneously preserving background integrity and fidelity. To address this, we propose the first parameter-sharing dual-diffusion backbone architecture specifically designed for this task, unifying feature extraction and inpainting for both reference furniture objects and real indoor backgrounds. We introduce RoomBench++, a large-scale, reproducible benchmark comprising 112K training pairs, integrating multi-source data from photorealistic rendering and video capture. Our method is compatible with U-Net and DiT backbones and enforces cross-modal feature alignment via parameter sharing. Extensive evaluation demonstrates state-of-the-art performance across quantitative metrics (e.g., FID, LPIPS), visual quality, and human preference studies. Moreover, our approach exhibits strong zero-shot generalization to unseen indoor layouts and generic scenes. The code and dataset are publicly released.
📝 Abstract
Virtual furniture synthesis, which seamlessly integrates reference objects into indoor scenes while maintaining geometric coherence and visual realism, holds substantial promise for home design and e-commerce applications. However, this field remains underexplored due to the scarcity of reproducible benchmarks and the limitations of existing image composition methods in achieving high-fidelity furniture synthesis while preserving background integrity. To overcome these challenges, we first present RoomBench++, a comprehensive and publicly available benchmark dataset tailored for this task. It consists of 112,851 training pairs and 1,832 testing pairs drawn from both real-world indoor videos and realistic home design renderings, thereby supporting robust training and evaluation under practical conditions. Then, we propose RoomEditor++, a versatile diffusion-based architecture featuring a parameter-sharing dual diffusion backbone, which is compatible with both U-Net and DiT architectures. This design unifies the feature extraction and inpainting processes for reference and background images. Our in-depth analysis reveals that the parameter-sharing mechanism enforces aligned feature representations, facilitating precise geometric transformations, texture preservation, and seamless integration. Extensive experiments validate that RoomEditor++ is superior over state-of-the-art approaches in terms of quantitative metrics, qualitative assessments, and human preference studies, while highlighting its strong generalization to unseen indoor scenes and general scenes without task-specific fine-tuning. The dataset and source code are available at url{https://github.com/stonecutter-21/roomeditor}.