CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation

📅 2025-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the weak 3D awareness and limited spatial/camera controllability in text-to-video generation, proposing the first 3D-controllable video generation framework tailored for cinematic production. Methodologically: (1) it introduces a two-stage interactive 3D conditioning pipeline enabling precise object placement, parametric camera trajectory specification, and compositional control in 3D space; (2) it establishes the first large-scale automated 3D video annotation pipeline, jointly modeling depth, camera pose, and semantic segmentation for multi-attribute controllability; (3) it integrates multi-view depth-aware rendering with diffusion-based conditional guidance to achieve 3D-aware video distillation. Experiments demonstrate state-of-the-art performance in 3D consistency, motion controllability, and text-video alignment, significantly advancing professional-grade cinematic scene layout generation.

Technology Category

Application Category

📝 Abstract
In this work, we present CineMaster, a novel framework for 3D-aware and controllable text-to-video generation. Our goal is to empower users with comparable controllability as professional film directors: precise placement of objects within the scene, flexible manipulation of both objects and camera in 3D space, and intuitive layout control over the rendered frames. To achieve this, CineMaster operates in two stages. In the first stage, we design an interactive workflow that allows users to intuitively construct 3D-aware conditional signals by positioning object bounding boxes and defining camera movements within the 3D space. In the second stage, these control signals--comprising rendered depth maps, camera trajectories and object class labels--serve as the guidance for a text-to-video diffusion model, ensuring to generate the user-intended video content. Furthermore, to overcome the scarcity of in-the-wild datasets with 3D object motion and camera pose annotations, we carefully establish an automated data annotation pipeline that extracts 3D bounding boxes and camera trajectories from large-scale video data. Extensive qualitative and quantitative experiments demonstrate that CineMaster significantly outperforms existing methods and implements prominent 3D-aware text-to-video generation. Project page: https://cinemaster-dev.github.io/.
Problem

Research questions and friction points this paper is trying to address.

3D-aware text-to-video generation
User-controlled object and camera manipulation
Automated 3D data annotation from videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D-aware text-to-video generation
Interactive 3D object and camera control
Automated 3D data annotation pipeline
🔎 Similar Papers
No similar papers found.