Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

To address low pose fidelity, view inconsistency, and challenging occlusion-aware geometric reasoning in camera-control video generation, this paper proposes a depth-agnostic generative framework. Methodologically, it introduces (1) an infinite homography warping mechanism—first of its kind—that directly models 3D camera rotation in the 2D latent space, bypassing error-prone depth estimation; (2) a synthetic multi-view data augmentation pipeline enabling end-to-end training with variable focal lengths and diverse camera trajectories; and (3) a geometry-aware video diffusion architecture integrating latent-space conditional modeling with end-to-end residual disparity prediction. Experiments demonstrate significant improvements over state-of-the-art baselines in both pose accuracy and visual quality. Moreover, the method exhibits strong cross-domain generalization: models trained solely on synthetic data transfer effectively to real-world videos.

Technology Category

Application Category

📝 Abstract

Recent progress in video diffusion models has spurred growing interest in camera-controlled novel-view video generation for dynamic scenes, aiming to provide creators with cinematic camera control capabilities in post-production. A key challenge in camera-controlled video generation is ensuring fidelity to the specified camera pose, while maintaining view consistency and reasoning about occluded geometry from limited observations. To address this, existing methods either train trajectory-conditioned video generation model on trajectory-video pair dataset, or estimate depth from the input video to reproject it along a target trajectory and generate the unprojected regions. Nevertheless, existing methods struggle to generate camera-pose-faithful, high-quality videos for two main reasons: (1) reprojection-based approaches are highly susceptible to errors caused by inaccurate depth estimation; and (2) the limited diversity of camera trajectories in existing datasets restricts learned models. To address these limitations, we present InfCam, a depth-free, camera-controlled video-to-video generation framework with high pose fidelity. The framework integrates two key components: (1) infinite homography warping, which encodes 3D camera rotations directly within the 2D latent space of a video diffusion model. Conditioning on this noise-free rotational information, the residual parallax term is predicted through end-to-end training to achieve high camera-pose fidelity; and (2) a data augmentation pipeline that transforms existing synthetic multiview datasets into sequences with diverse trajectories and focal lengths. Experimental results demonstrate that InfCam outperforms baseline methods in camera-pose accuracy and visual fidelity, generalizing well from synthetic to real-world data. Link to our project page:https://emjay73.github.io/InfCam/

Problem

Research questions and friction points this paper is trying to address.

Ensures fidelity to specified camera poses in video generation.

Maintains view consistency and reasons about occluded geometry.

Overcomes limitations of inaccurate depth estimation and dataset diversity.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Infinite homography warping encodes 3D rotations in 2D latent space

End-to-end training predicts residual parallax for high pose fidelity

Data augmentation creates diverse camera trajectories from synthetic datasets

🔎 Similar Papers

VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control