IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key challenges in RGB-D video generation: imprecise camera trajectory control and geometric inconsistency between RGB and depth frames. To this end, we propose the Image-Depth Consistency Network (IDC-Net). Methodologically, IDC-Net introduces a geometry-aware diffusion model incorporating a novel geometry-aware Transformer module to jointly model RGB and depth data in spatiotemporal domains; it explicitly conditions generation on camera poses and is trained via metric alignment using a high-fidelity, precisely aligned camera–image–depth dataset. Experiments demonstrate that IDC-Net significantly outperforms existing methods in both visual quality and geometric fidelity. The generated RGB-D videos exhibit inter-frame metric consistency and fine-grained, pose-controllable camera motion. Crucially, outputs are directly usable for downstream 3D reconstruction tasks without post-processing, substantially enhancing practicality and system compatibility.

Technology Category

Application Category

📝 Abstract
We present IDC-Net (Image-Depth Consistency Network), a novel framework designed to generate RGB-D video sequences under explicit camera trajectory control. Unlike approaches that treat RGB and depth generation separately, IDC-Net jointly synthesizes both RGB images and corresponding depth maps within a unified geometry-aware diffusion model. The joint learning framework strengthens spatial and geometric alignment across frames, enabling more precise camera control in the generated sequences. To support the training of this camera-conditioned model and ensure high geometric fidelity, we construct a camera-image-depth consistent dataset with metric-aligned RGB videos, depth maps, and accurate camera poses, which provides precise geometric supervision with notably improved inter-frame geometric consistency. Moreover, we introduce a geometry-aware transformer block that enables fine-grained camera control, enhancing control over the generated sequences. Extensive experiments show that IDC-Net achieves improvements over state-of-the-art approaches in both visual quality and geometric consistency of generated scene sequences. Notably, the generated RGB-D sequences can be directly feed for downstream 3D Scene reconstruction tasks without extra post-processing steps, showcasing the practical benefits of our joint learning framework. See more at https://idcnet-scene.github.io.
Problem

Research questions and friction points this paper is trying to address.

Generates RGB-D videos with controlled camera trajectories
Ensures spatial and geometric alignment in generated sequences
Provides precise camera control for 3D scene reconstruction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified geometry-aware diffusion model for RGB-D
Camera-image-depth consistent dataset for training
Geometry-aware transformer block for precise control
🔎 Similar Papers
No similar papers found.
L
Lijuan Liu
Bytedance Inc.
W
Wenfa Li
Bytedance Inc.
D
Dongbo Zhang
Bytedance Inc.
S
Shuo Wang
Bytedance Inc.
Shaohui Jiao
Shaohui Jiao
Unknown affiliation