OmniCam: Unified Multimodal Video Generation via Camera Control

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing camera control methods suffer from cumbersome interaction, coarse-grained control, and weak multimodal coordination, resulting in videos with poor spatiotemporal consistency and inaccurate camera trajectories. To address these limitations, this paper introduces the first unified multimodal camera control framework, enabling flexible, compositional conditioning via text/video-guided trajectories jointly with image/video content references. We construct OmniTr—the first high-quality, long-sequence dataset of temporally aligned trajectory-video-caption triplets—and propose a novel architecture that synergistically integrates large language models with video diffusion models, incorporating an explicit spatiotemporal consistency modeling module. The entire framework is trained end-to-end on OmniTr. Extensive experiments demonstrate state-of-the-art performance across multiple quantitative and qualitative metrics, significantly improving spatiotemporal coherence, trajectory tracking fidelity, and visual expressiveness of generated videos.

Technology Category

Application Category

📝 Abstract
Camera control, which achieves diverse visual effects by changing camera position and pose, has attracted widespread attention. However, existing methods face challenges such as complex interaction and limited control capabilities. To address these issues, we present OmniCam, a unified multimodal camera control framework. Leveraging large language models and video diffusion models, OmniCam generates spatio-temporally consistent videos. It supports various combinations of input modalities: the user can provide text or video with expected trajectory as camera path guidance, and image or video as content reference, enabling precise control over camera motion. To facilitate the training of OmniCam, we introduce the OmniTr dataset, which contains a large collection of high-quality long-sequence trajectories, videos, and corresponding descriptions. Experimental results demonstrate that our model achieves state-of-the-art performance in high-quality camera-controlled video generation across various metrics.
Problem

Research questions and friction points this paper is trying to address.

Unified multimodal camera control for video generation
Overcoming complex interaction and limited control in camera motion
Generating spatio-temporally consistent videos with diverse inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal camera control framework
Leverages LLMs and video diffusion models
Supports text or video trajectory guidance
🔎 Similar Papers
No similar papers found.
X
Xiaoda Yang
Zhejiang University
Jiayang Xu
Jiayang Xu
University of Michigan, Aerospace Engineering
Reduced Order Modeling in CFD
K
Kaixuan Luan
Zhejiang University
Xinyu Zhan
Xinyu Zhan
Shanghai Jiao Tong University
H
Hongshun Qiu
Beijing University of Technology
S
Shijun Shi
Jiangnan University
H
Hao Li
University of Science and Technology of China
S
Shuai Yang
Zhejiang University
L
Li Zhang
University of Science and Technology of China
Checheng Yu
Checheng Yu
Nanjing University
RoboticsRL
C
Cewu Lu
Shanghai Jiao Tong University
L
Lixin Yang
Shanghai Jiao Tong University