VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation

📅 2025-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing image-to-video generation methods only support coarse-grained control over one or two visual elements (e.g., camera or object motion), failing to jointly and precisely regulate multiple critical factors—such as camera motion, object motion, and lighting direction. To address this limitation, we propose the first framework enabling fine-grained, coordinated control over all three elements. Our approach introduces a spatial triple-attention Transformer architecture, constructs VLD—the first synthetic video dataset with explicit lighting-direction annotations—and proposes a novel three-stage decoupled training paradigm that eliminates the need for joint multi-attribute annotation. By integrating multimodal conditioning (text, image, and lighting), our method enables physically plausible modeling of light transmission and reflection. Extensive experiments demonstrate significant improvements over state-of-the-art methods across multiple benchmarks, achieving superior control accuracy, temporal consistency, visual fidelity, and physical plausibility.

Technology Category

Application Category

📝 Abstract
Recent image-to-video generation methods have demonstrated success in enabling control over one or two visual elements, such as camera trajectory or object motion. However, these methods are unable to offer control over multiple visual elements due to limitations in data and network efficacy. In this paper, we introduce VidCRAFT3, a novel framework for precise image-to-video generation that enables control over camera motion, object motion, and lighting direction simultaneously. To better decouple control over each visual element, we propose the Spatial Triple-Attention Transformer, which integrates lighting direction, text, and image in a symmetric way. Since most real-world video datasets lack lighting annotations, we construct a high-quality synthetic video dataset, the VideoLightingDirection (VLD) dataset. This dataset includes lighting direction annotations and objects of diverse appearance, enabling VidCRAFT3 to effectively handle strong light transmission and reflection effects. Additionally, we propose a three-stage training strategy that eliminates the need for training data annotated with multiple visual elements (camera motion, object motion, and lighting direction) simultaneously. Extensive experiments on benchmark datasets demonstrate the efficacy of VidCRAFT3 in producing high-quality video content, surpassing existing state-of-the-art methods in terms of control granularity and visual coherence. All code and data will be publicly available. Project page: https://sixiaozheng.github.io/VidCRAFT3/.
Problem

Research questions and friction points this paper is trying to address.

Control multiple visual elements in video generation.
Precise control over camera, object, and lighting.
Generate high-quality videos without multi-element annotations.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simultaneous control over multiple visual elements
Spatial Triple-Attention Transformer integration
High-quality synthetic video dataset creation
🔎 Similar Papers
No similar papers found.
S
Sixiao Zheng
Fudan University, China
Z
Zimian Peng
Zhejiang University, China
Yanpeng Zhou
Yanpeng Zhou
NOAH'S ARK LAB
Y
Yi Zhu
Huawei Noah’s Ark Lab, China
H
Hang Xu
Huawei Noah’s Ark Lab, China
Xiangru Huang
Xiangru Huang
Westlake University
Machine Learning and OptimizationGeometry ProcessingDeep Learning
Yanwei Fu
Yanwei Fu
Fudan University
Computer visionmachine learningMultimedia