Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing video diffusion models struggle to simultaneously achieve fine-grained control over scene composition, multi-view consistent subject customization, and motion dynamics such as camera or object movement. To address this limitation, this work proposes the Tri-Prompting framework, which employs a two-stage training strategy to jointly model scene, subject, and motion. The approach leverages 3D tracking points to drive background motion and downsampled RGB cues to guide foreground subjects, while introducing a ControlNet scaling scheduler during inference to balance controllability and visual fidelity. This method is the first to enable unified, controllable generation of all three elements, supporting 3D-aware subject insertion and motion editing of existing subjects. It significantly outperforms baselines such as Phantom and DaS in multi-view identity preservation, 3D consistency, and motion accuracy, overcoming the prior constraint of isolated control over only a single dimension.

Technology Category

Application Category

📝 Abstract

Recent video diffusion models have made remarkable strides in visual quality, yet precise, fine-grained control remains a key bottleneck that limits practical customizability for content creation. For AI video creators, three forms of control are crucial: (i) scene composition, (ii) multi-view consistent subject customization, and (iii) camera-pose or object-motion adjustment. Existing methods typically handle these dimensions in isolation, with limited support for multi-view subject synthesis and identity preservation under arbitrary pose changes. This lack of a unified architecture makes it difficult to support versatile, jointly controllable video. We introduce Tri-Prompting, a unified framework and two-stage training paradigm that integrates scene composition, multi-view subject consistency, and motion control. Our approach leverages a dual-condition motion module driven by 3D tracking points for background scenes and downsampled RGB cues for foreground subjects. To ensure a balance between controllability and visual realism, we further propose an inference ControlNet scale schedule. Tri-Prompting supports novel workflows, including 3D-aware subject insertion into any scenes and manipulation of existing subjects in an image. Experimental results demonstrate that Tri-Prompting significantly outperforms specialized baselines such as Phantom and DaS in multi-view subject identity, 3D consistency, and motion accuracy.

Problem

Research questions and friction points this paper is trying to address.

video diffusion

scene composition

subject customization

motion control

multi-view consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tri-Prompting

video diffusion

unified control