DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

📅 2025-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing single-shot video generation methods rely on naive frame concatenation, resulting in visual discontinuities, physically implausible motion, and limited artistic expressiveness. To address these limitations, we propose a keyframe-guided, long-duration one-take video generation framework. Our method introduces a lightweight intermediate conditioning injection mechanism with adaptive fine-tuning for fine-grained spatiotemporal control; integrates vision-language supervised fine-tuning (SFT) and customized Direct Preference Optimization (DPO) to enhance semantic consistency and aesthetic quality; and devises a segmented autoregressive (SAR) inference paradigm atop the DiT architecture for efficient, controllable long-video synthesis. Experiments demonstrate that our approach generates high-fidelity, cinematic-style videos spanning tens of seconds—achieving seamless transitions, physically plausible subject dynamics, and superior artistic expressiveness over baselines—while maintaining memory efficiency and user controllability.

Technology Category

Application Category

📝 Abstract
The "one-shot" technique represents a distinct and sophisticated aesthetic in filmmaking. However, its practical realization is often hindered by prohibitive costs and complex real-world constraints. Although emerging video generation models offer a virtual alternative, existing approaches typically rely on naive clip concatenation, which frequently fails to maintain visual smoothness and temporal coherence. In this paper, we introduce DreaMontage, a comprehensive framework designed for arbitrary frame-guided generation, capable of synthesizing seamless, expressive, and long-duration one-shot videos from diverse user-provided inputs. To achieve this, we address the challenge through three primary dimensions. (i) We integrate a lightweight intermediate-conditioning mechanism into the DiT architecture. By employing an Adaptive Tuning strategy that effectively leverages base training data, we unlock robust arbitrary-frame control capabilities. (ii) To enhance visual fidelity and cinematic expressiveness, we curate a high-quality dataset and implement a Visual Expression SFT stage. In addressing critical issues such as subject motion rationality and transition smoothness, we apply a Tailored DPO scheme, which significantly improves the success rate and usability of the generated content. (iii) To facilitate the production of extended sequences, we design a Segment-wise Auto-Regressive (SAR) inference strategy that operates in a memory-efficient manner. Extensive experiments demonstrate that our approach achieves visually striking and seamlessly coherent one-shot effects while maintaining computational efficiency, empowering users to transform fragmented visual materials into vivid, cohesive one-shot cinematic experiences.
Problem

Research questions and friction points this paper is trying to address.

Generates seamless long one-shot videos from arbitrary frames
Enhances visual fidelity and motion coherence in video synthesis
Enables memory-efficient production of extended cinematic sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Tuning for arbitrary-frame control in DiT
Tailored DPO for motion and transition smoothness
Segment-wise Auto-Regressive inference for long videos
🔎 Similar Papers
No similar papers found.
J
Jiawei Liu
Intelligence Creation Team, ByteDance
J
Junqiao Li
Intelligence Creation Team, ByteDance
Jiangfan Deng
Jiangfan Deng
ByteDance Inc.
Object DetectionScene AnalysisVR/ARVisual Generation
G
Gen Li
Intelligence Creation Team, ByteDance
S
Siyu Zhou
Intelligence Creation Team, ByteDance
Z
Zetao Fang
Intelligence Creation Team, ByteDance
S
Shanshan Lao
Intelligence Creation Team, ByteDance
Zengde Deng
Zengde Deng
ByteDance
Machine LearningOptimizationData ScienceOperations Research
Jianing Zhu
Jianing Zhu
Postdoctoral Fellow, University of Texas at Austin
Machine LearningTrustworthy Machine LearningResponsible AINeuro-symbolic AI
Tingting Ma
Tingting Ma
Bytedance Inc.
Large Language Model
J
Jiayi Li
Intelligence Creation Team, ByteDance
Y
Yunqiu Wang
Intelligence Creation Team, ByteDance
Qian He
Qian He
ByteDance
Xinglong Wu
Xinglong Wu
字节跳动算法工程师
人工智能