Video-As-Prompt: Unified Semantic Control for Video Generation

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Semantic control in video generation remains challenging: existing approaches either introduce pixel-level priors causing artifacts or rely on task-specific fine-tuning/architectures, limiting generalization. This paper proposes Video-As-Prompt (VAP), a novel paradigm that directly uses a reference video as a semantic prompt—eliminating the need for explicit conditioning signals. VAP integrates a frozen video diffusion Transformer (DiT), temporal-biased positional encoding (to prevent spurious spatiotemporal correspondences), and a Mixture-of-Transformers (MoT) expert-guidance mechanism to enable in-context zero-shot semantic alignment and unified multi-task control. We construct VAP-Data, a million-scale dataset, to support training. Evaluated on open-source benchmarks, VAP achieves state-of-the-art performance, attains a 38.7% user preference rate—on par with proprietary commercial models—and significantly advances general-purpose controllable video generation.

Technology Category

Application Category

📝 Abstract
Unified, generalizable semantic control in video generation remains a critical open challenge. Existing methods either introduce artifacts by enforcing inappropriate pixel-wise priors from structure-based controls, or rely on non-generalizable, condition-specific finetuning or task-specific architectures. We introduce Video-As-Prompt (VAP), a new paradigm that reframes this problem as in-context generation. VAP leverages a reference video as a direct semantic prompt, guiding a frozen Video Diffusion Transformer (DiT) via a plug-and-play Mixture-of-Transformers (MoT) expert. This architecture prevents catastrophic forgetting and is guided by a temporally biased position embedding that eliminates spurious mapping priors for robust context retrieval. To power this approach and catalyze future research, we built VAP-Data, the largest dataset for semantic-controlled video generation with over 100K paired videos across 100 semantic conditions. As a single unified model, VAP sets a new state-of-the-art for open-source methods, achieving a 38.7% user preference rate that rivals leading condition-specific commercial models. VAP's strong zero-shot generalization and support for various downstream applications mark a significant advance toward general-purpose, controllable video generation.
Problem

Research questions and friction points this paper is trying to address.

Achieving unified semantic control in video generation
Overcoming artifacts from pixel-wise priors in existing methods
Enabling generalizable video generation without task-specific architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses reference video as direct semantic prompt
Employs plug-and-play Mixture-of-Transformers expert architecture
Leverages temporally biased position embedding for robustness
Yuxuan Bian
Yuxuan Bian
The Chinese University of Hong Kong
Machine LearningComputer VisionVideo Generation
X
Xin Chen
Intelligent Creation Lab, ByteDance
Z
Zenan Li
Intelligent Creation Lab, ByteDance
T
Tiancheng Zhi
Intelligent Creation Lab, ByteDance
S
Shen Sang
Intelligent Creation Lab, ByteDance
Linjie Luo
Linjie Luo
Research Manager at ByteDance AI Lab
Computer GraphicsComputer Vision
Q
Qiang Xu
The Chinese University of Hong Kong