GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control

📅 2025-03-05

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Existing video generation models suffer from weak 3D priors, leading to temporal object inconsistencies (e.g., flickering or popping) and coarse-grained camera control—relying on implicit pose estimation rather than explicit geometric constraints. GEN3C addresses these limitations with a generative framework built upon an updateable 3D point cloud cache, explicitly coupling user-specified camera trajectories with scene geometry to synthesize temporally coherent, 3D-consistent videos in world coordinates. Its novel 3D cache guidance mechanism decouples extrinsic camera parameter control from content generation, eliminating reliance on frame-wise memory or implicit pose modeling, thereby enabling precise extrinsic control and dynamic scene evolution. The method integrates depth prediction, sparse-view neural rendering, and 3D-conditioned diffusion modeling. Evaluated on challenging settings—including autonomous driving scenes and monocular dynamic video generation—GEN3C achieves state-of-the-art performance in sparse-view novel-view synthesis and delivers the most accurate user-controllable camera-motion video generation to date.

Technology Category

Application Category

📝 Abstract

We present GEN3C, a generative video model with precise Camera Control and temporal 3D Consistency. Prior video models already generate realistic videos, but they tend to leverage little 3D information, leading to inconsistencies, such as objects popping in and out of existence. Camera control, if implemented at all, is imprecise, because camera parameters are mere inputs to the neural network which must then infer how the video depends on the camera. In contrast, GEN3C is guided by a 3D cache: point clouds obtained by predicting the pixel-wise depth of seed images or previously generated frames. When generating the next frames, GEN3C is conditioned on the 2D renderings of the 3D cache with the new camera trajectory provided by the user. Crucially, this means that GEN3C neither has to remember what it previously generated nor does it have to infer the image structure from the camera pose. The model, instead, can focus all its generative power on previously unobserved regions, as well as advancing the scene state to the next frame. Our results demonstrate more precise camera control than prior work, as well as state-of-the-art results in sparse-view novel view synthesis, even in challenging settings such as driving scenes and monocular dynamic video. Results are best viewed in videos. Check out our webpage! https://research.nvidia.com/labs/toronto-ai/GEN3C/

Problem

Research questions and friction points this paper is trying to address.

Improves 3D consistency in video generation

Enables precise user-controlled camera trajectories

Addresses object inconsistencies in dynamic scenes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses 3D cache for consistent video generation

Precise camera control via user-defined trajectories

Focuses generative power on unobserved scene regions

🔎 Similar Papers

VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control