🤖 AI Summary
Monocular video camera trajectory retargeting suffers from insufficient precision and controllability. Method: We propose the first dual-stream conditional video diffusion model that decouples deterministic viewpoint transformation from stochastic content generation, enabling user-specified arbitrary 4D camera paths. To train it, we design a dual reprojection strategy to construct hybrid training data—integrating web-scale monocular videos with static multi-view datasets—and introduce point-cloud rendering guidance jointly conditioned on source video to ensure spatiotemporal consistency in novel-view synthesis. Contribution/Results: Our method generalizes to diverse scenes without requiring multi-view input. Experiments demonstrate significant improvements over state-of-the-art approaches on both multi-view benchmarks and large-scale monocular video datasets, achieving high-fidelity 4D content generation with precise, user-controllable camera trajectory modeling.
📝 Abstract
We present TrajectoryCrafter, a novel approach to redirect camera trajectories for monocular videos. By disentangling deterministic view transformations from stochastic content generation, our method achieves precise control over user-specified camera trajectories. We propose a novel dual-stream conditional video diffusion model that concurrently integrates point cloud renders and source videos as conditions, ensuring accurate view transformations and coherent 4D content generation. Instead of leveraging scarce multi-view videos, we curate a hybrid training dataset combining web-scale monocular videos with static multi-view datasets, by our innovative double-reprojection strategy, significantly fostering robust generalization across diverse scenes. Extensive evaluations on multi-view and large-scale monocular videos demonstrate the superior performance of our method.