IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses two key challenges in RGB-D video generation: imprecise camera trajectory control and geometric inconsistency between RGB and depth frames. To this end, we propose the Image-Depth Consistency Network (IDC-Net). Methodologically, IDC-Net introduces a geometry-aware diffusion model incorporating a novel geometry-aware Transformer module to jointly model RGB and depth data in spatiotemporal domains; it explicitly conditions generation on camera poses and is trained via metric alignment using a high-fidelity, precisely aligned camera–image–depth dataset. Experiments demonstrate that IDC-Net significantly outperforms existing methods in both visual quality and geometric fidelity. The generated RGB-D videos exhibit inter-frame metric consistency and fine-grained, pose-controllable camera motion. Crucially, outputs are directly usable for downstream 3D reconstruction tasks without post-processing, substantially enhancing practicality and system compatibility.

Technology Category

Application Category

📝 Abstract

We present IDC-Net (Image-Depth Consistency Network), a novel framework designed to generate RGB-D video sequences under explicit camera trajectory control. Unlike approaches that treat RGB and depth generation separately, IDC-Net jointly synthesizes both RGB images and corresponding depth maps within a unified geometry-aware diffusion model. The joint learning framework strengthens spatial and geometric alignment across frames, enabling more precise camera control in the generated sequences. To support the training of this camera-conditioned model and ensure high geometric fidelity, we construct a camera-image-depth consistent dataset with metric-aligned RGB videos, depth maps, and accurate camera poses, which provides precise geometric supervision with notably improved inter-frame geometric consistency. Moreover, we introduce a geometry-aware transformer block that enables fine-grained camera control, enhancing control over the generated sequences. Extensive experiments show that IDC-Net achieves improvements over state-of-the-art approaches in both visual quality and geometric consistency of generated scene sequences. Notably, the generated RGB-D sequences can be directly feed for downstream 3D Scene reconstruction tasks without extra post-processing steps, showcasing the practical benefits of our joint learning framework. See more at https://idcnet-scene.github.io.

Problem

Research questions and friction points this paper is trying to address.

Generates RGB-D videos with controlled camera trajectories

Ensures spatial and geometric alignment in generated sequences

Provides precise camera control for 3D scene reconstruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified geometry-aware diffusion model for RGB-D

Camera-image-depth consistent dataset for training

Geometry-aware transformer block for precise control

🔎 Similar Papers

No similar papers found.