Mind the Time: Temporally-Controlled Multi-Event Video Generation

📅 2024-12-06

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Existing video generation models rely on single-paragraph textual prompts, making precise temporal control over multiple events challenging—often resulting in event omissions or chronological inconsistencies. To address this, we propose the first multi-event video generation framework enabling explicit specification of start and end time intervals for each event. Our method binds each event with fine-grained temporal annotations and introduces ReRoPE, a time-aware positional encoding, to achieve cross-modal temporal alignment between event descriptions and video tokens. Built upon a video diffusion Transformer architecture, the model is fine-tuned on temporally annotated video data. Experiments demonstrate significant improvements over state-of-the-art commercial and open-source models in event completeness, temporal accuracy, and transition naturalness. To our knowledge, this is the first approach enabling controllable spatiotemporal orchestration of multiple events within generated videos.

Technology Category

Application Category

📝 Abstract

Real-world videos consist of sequences of events. Generating such sequences with precise temporal control is infeasible with existing video generators that rely on a single paragraph of text as input. When tasked with generating multiple events described using a single prompt, such methods often ignore some of the events or fail to arrange them in the correct order. To address this limitation, we present MinT, a multi-event video generator with temporal control. Our key insight is to bind each event to a specific period in the generated video, which allows the model to focus on one event at a time. To enable time-aware interactions between event captions and video tokens, we design a time-based positional encoding method, dubbed ReRoPE. This encoding helps to guide the cross-attention operation. By fine-tuning a pre-trained video diffusion transformer on temporally grounded data, our approach produces coherent videos with smoothly connected events. For the first time in the literature, our model offers control over the timing of events in generated videos. Extensive experiments demonstrate that MinT outperforms existing commercial and open-source models by a large margin.

Problem

Research questions and friction points this paper is trying to address.

Generates multi-event videos with precise temporal control.

Addresses failure in arranging events in correct order.

Introduces time-based positional encoding for event-video interaction.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporally-controlled multi-event video generation

Time-based positional encoding (ReRoPE)

Fine-tuned video diffusion transformer

🔎 Similar Papers

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

2024-08-15arXiv.orgCitations: 7

TikTok

San Jose, California

Sr. Research Engineer/Scientist (all levels), World Models

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence