🤖 AI Summary
This work addresses two key challenges in RGB-D video generation: imprecise camera trajectory control and geometric inconsistency between RGB and depth frames. To this end, we propose the Image-Depth Consistency Network (IDC-Net). Methodologically, IDC-Net introduces a geometry-aware diffusion model incorporating a novel geometry-aware Transformer module to jointly model RGB and depth data in spatiotemporal domains; it explicitly conditions generation on camera poses and is trained via metric alignment using a high-fidelity, precisely aligned camera–image–depth dataset. Experiments demonstrate that IDC-Net significantly outperforms existing methods in both visual quality and geometric fidelity. The generated RGB-D videos exhibit inter-frame metric consistency and fine-grained, pose-controllable camera motion. Crucially, outputs are directly usable for downstream 3D reconstruction tasks without post-processing, substantially enhancing practicality and system compatibility.
📝 Abstract
We present IDC-Net (Image-Depth Consistency Network), a novel framework designed to generate RGB-D video sequences under explicit camera trajectory control. Unlike approaches that treat RGB and depth generation separately, IDC-Net jointly synthesizes both RGB images and corresponding depth maps within a unified geometry-aware diffusion model. The joint learning framework strengthens spatial and geometric alignment across frames, enabling more precise camera control in the generated sequences. To support the training of this camera-conditioned model and ensure high geometric fidelity, we construct a camera-image-depth consistent dataset with metric-aligned RGB videos, depth maps, and accurate camera poses, which provides precise geometric supervision with notably improved inter-frame geometric consistency. Moreover, we introduce a geometry-aware transformer block that enables fine-grained camera control, enhancing control over the generated sequences. Extensive experiments show that IDC-Net achieves improvements over state-of-the-art approaches in both visual quality and geometric consistency of generated scene sequences. Notably, the generated RGB-D sequences can be directly feed for downstream 3D Scene reconstruction tasks without extra post-processing steps, showcasing the practical benefits of our joint learning framework. See more at https://idcnet-scene.github.io.