CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

Existing controllable video generation methods either rely on text prompts, leading to imprecise camera control, or require manually specified trajectory parameters, hindering automation. This work proposes CT-1, the first model to jointly integrate visual, linguistic, and camera modeling into video generation. By training on large-scale data, CT-1 learns spatial reasoning capabilities and accurately estimates camera trajectories to guide the diffusion process. We introduce CT-200K, a dataset comprising 47 million frames, and design a wavelet-based frequency-domain regularization loss within a Diffusion Transformer architecture. Our approach improves camera control accuracy by 25.7%, generating high-fidelity, physically plausible videos and effectively bridging the gap between spatial reasoning and video synthesis.

Technology Category

Application Category

📝 Abstract

Camera-controllable video generation aims to synthesize videos with flexible and physically plausible camera movements. However, existing methods either provide imprecise camera control from text prompts or rely on labor-intensive manual camera trajectory parameters, limiting their use in automated scenarios. To address these issues, we propose a novel Vision-Language-Camera model, termed CT-1 (Camera Transformer 1), a specialized model designed to transfer spatial reasoning knowledge to video generation by accurately estimating camera trajectories. Built upon vision-language modules and a Diffusion Transformer model, CT-1 employs a Wavelet-based Regularization Loss in the frequency domain to effectively learn complex camera trajectory distributions. These trajectories are integrated into a video diffusion model to enable spatially aware camera control that aligns with user intentions. To facilitate the training of CT-1, we design a dedicated data curation pipeline and construct CT-200K, a large-scale dataset containing over 47M frames. Experimental results demonstrate that our framework successfully bridges the gap between spatial reasoning and video synthesis, yielding faithful and high-quality camera-controllable videos and improving camera control accuracy by 25.7% over prior methods.

Problem

Research questions and friction points this paper is trying to address.

camera-controllable video generation

spatial reasoning

camera trajectory

video synthesis

vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Camera Model

Camera-Controllable Video Generation

Wavelet-based Regularization Loss