Generative Photographic Control for Scene-Consistent Video Cinematic Editing

📅 2025-11-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing generative video models lack fine-grained, disentangled control over professional cinematographic parameters—such as depth of field and shutter speed—limiting cinematic storytelling capability and spatiotemporal scene coherence. To address this, we propose CineCtrl, the first video editing framework enabling cinematography-grade parameter control. Our method introduces a disentangled cross-attention mechanism that explicitly separates camera motion modeling from photographic effect modeling, ensuring parameter controllability and inter-frame temporal consistency. Furthermore, we construct a large-scale video dataset with precise, physics-informed and real-world cinematographic parameter annotations. Experiments demonstrate that CineCtrl accurately honors user-specified multi-dimensional cinematographic parameters, generating high-fidelity, temporally coherent cinematic videos. It significantly outperforms prior methods in depth-of-field and exposure control tasks, establishing new state-of-the-art performance in parameter-aware video editing.

Technology Category

Application Category

📝 Abstract
Cinematic storytelling is profoundly shaped by the artful manipulation of photographic elements such as depth of field and exposure. These effects are crucial in conveying mood and creating aesthetic appeal. However, controlling these effects in generative video models remains highly challenging, as most existing methods are restricted to camera motion control. In this paper, we propose CineCtrl, the first video cinematic editing framework that provides fine control over professional camera parameters (e.g., bokeh, shutter speed). We introduce a decoupled cross-attention mechanism to disentangle camera motion from photographic inputs, allowing fine-grained, independent control without compromising scene consistency. To overcome the shortage of training data, we develop a comprehensive data generation strategy that leverages simulated photographic effects with a dedicated real-world collection pipeline, enabling the construction of a large-scale dataset for robust model training. Extensive experiments demonstrate that our model generates high-fidelity videos with precisely controlled, user-specified photographic camera effects.
Problem

Research questions and friction points this paper is trying to address.

Controls photographic effects like depth of field in generative video models
Disentangles camera motion from photographic inputs for independent control
Addresses training data scarcity through simulated and real-world collection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled cross-attention mechanism for motion disentanglement
Comprehensive data generation with simulated effects
Fine-grained control over professional camera parameters
🔎 Similar Papers
No similar papers found.
H
Huiqiang Sun
School of AIA, Huazhong University of Science and Technology
Liao Shen
Liao Shen
Huazhong University of Science and Technology
computer vision
Z
Zhan Peng
School of AIA, Huazhong University of Science and Technology
K
Kun Wang
SenseTime Research
Size Wu
Size Wu
Nanyang Technological University
computer vision
Yuhang Zang
Yuhang Zang
Shanghai AI Laboratory
Natural Language ProcessingVision Language Model
T
Tianqi Liu
School of AIA, Huazhong University of Science and Technology
Z
Zihao Huang
S-Lab, Nanyang Technological University
Xingyu Zeng
Xingyu Zeng
Shenzhen University of Advanced Technology
Computer VisionDeep Learning
Zhiguo Cao
Zhiguo Cao
Huazhong University of Science and Technology
Pattern RecognitionComputer Vision
W
Wei Li
S-Lab, Nanyang Technological University
Chen Change Loy
Chen Change Loy
President's Chair Professor, MMLab@NTU, S-Lab, Nanyang Technological University
Computer VisionImage ProcessingMachine Learning