CamCtrl3D: Single-Image Scene Exploration with Precise 3D Camera Control

📅 2025-01-10

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the problem of generating high-fidelity fly-through videos from a single image, conditioned on a prescribed 3D camera trajectory—where the core challenge lies in maintaining geometric consistency and detail fidelity under dynamic viewpoint changes. To this end, we propose a quadruple 3D-camera-aware conditioning mechanism: (i) explicit extrinsic parameter injection, (ii) camera-ray-based image encoding, (iii) inter-frame reprojection guidance, and (iv) a 2D↔3D cross-dimensional Transformer. These are integrated into a latent diffusion framework equipped with a ControlNet-style multi-condition fusion architecture, spatiotemporal U-Net backbone, and scale-normalized training. We further introduce novel evaluation metrics jointly assessing video quality and view consistency. Our method achieves state-of-the-art performance on single-image scene exploration, significantly improving depth consistency, structural coherence, and texture-detail stability across frames.

Technology Category

Application Category

📝 Abstract

We propose a method for generating fly-through videos of a scene, from a single image and a given camera trajectory. We build upon an image-to-video latent diffusion model. We condition its UNet denoiser on the camera trajectory, using four techniques. (1) We condition the UNet's temporal blocks on raw camera extrinsics, similar to MotionCtrl. (2) We use images containing camera rays and directions, similar to CameraCtrl. (3) We reproject the initial image to subsequent frames and use the resulting video as a condition. (4) We use 2D<=>3D transformers to introduce a global 3D representation, which implicitly conditions on the camera poses. We combine all conditions in a ContolNet-style architecture. We then propose a metric that evaluates overall video quality and the ability to preserve details with view changes, which we use to analyze the trade-offs of individual and combined conditions. Finally, we identify an optimal combination of conditions. We calibrate camera positions in our datasets for scale consistency across scenes, and we train our scene exploration model, CamCtrl3D, demonstrating state-of-theart results.

Problem

Research questions and friction points this paper is trying to address.

3D video generation

camera path presetting

perspective coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

CamCtrl3D

2D-3D Fusion

Perspective Consistency Optimization

🔎 Similar Papers

Generalizable 3D Scene Reconstruction via Divide and Conquer from a Single View

2024-04-04arXiv.orgCitations: 3

Bosch Group

Hildesheim, NDS, DE

AI Research Scientist, Computer Vision - Facebook Video Intelligence