Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation

📅 2026-04-11

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing video diffusion models struggle to precisely control the temporal relationships among multiple events—such as their sequential order, duration, and timing—leading to semantic ambiguity and poor text-video alignment. This work proposes a plug-and-play inference-time method that, without altering the model architecture or incurring additional computational overhead, introduces temporally aware prompt assignment and attention masking within the cross-attention mechanism. This ensures that each video segment responds exclusively to its corresponding textual prompt. For the first time, the approach enables training-free, temporally disentangled control of multiple events during inference, significantly improving temporal-semantic alignment, mitigating interference between events, and enhancing both narrative coherence and visual quality of the generated videos.

Technology Category

Application Category

📝 Abstract

Video diffusion models have achieved remarkable progress in generating high-quality videos. However, these models struggle to represent the temporal succession of multiple events in real-world videos and lack explicit mechanisms to control when semantic concepts appear, how long they persist, and the order in which multiple events occur. Such control is especially important for movie-grade video synthesis, where coherent storytelling depends on precise timing, duration, and transitions between events. When using a single paragraph-style prompt to describe a sequence of complex events, models often exhibit semantic entanglement, where concepts intended for different moments in the video bleed into one another, resulting in poor text-video alignment. To address these limitations, we propose Prompt Relay, an inference-time, plug-and-play method to enable fine-grained temporal control in multi-event video generation, requiring no architectural modifications and no additional computational overhead. Prompt Relay introduces a penalty into the cross-attention mechanism, so that each temporal segment attends only to its assigned prompt, allowing the model to represent one semantic concept at a time and thereby improving temporal prompt alignment, reducing semantic interference, and enhancing visual quality.

Problem

Research questions and friction points this paper is trying to address.

temporal control

multi-event video generation

semantic entanglement

text-video alignment

video diffusion models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt Relay

temporal control

video diffusion models