When and Where do Events Switch in Multi-Event Video Generation?

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the fundamental challenge of uncontrollable event switching timing and positioning in multi-event text-to-video (T2V) generation. We systematically investigate, for the first time, the spatiotemporal mechanisms by which event prompts govern generation dynamics. To this end, we introduce MEve—a novel, self-constructed prompt benchmark specifically designed for multi-event T2V evaluation—and employ hierarchical activation analysis alongside stepwise ablation studies. Our findings reveal that event transitions are predominantly governed by the initial denoising steps and spatiotemporal attention modules, enabling us to formulate a principled spatiotemporal control criterion for event switching. Experiments on OpenSora and CogVideoX validate the efficacy and interpretability of this mechanism. The study establishes the first explanatory foundation for multi-event T2V generation and identifies a new design direction—explicit modeling of event temporal controllability—in next-generation T2V architectures.

Technology Category

Application Category

📝 Abstract
Text-to-video (T2V) generation has surged in response to challenging questions, especially when a long video must depict multiple sequential events with temporal coherence and controllable content. Existing methods that extend to multi-event generation omit an inspection of the intrinsic factor in event shifting. The paper aims to answer the central question: When and where multi-event prompts control event transition during T2V generation. This work introduces MEve, a self-curated prompt suite for evaluating multi-event text-to-video (T2V) generation, and conducts a systematic study of two representative model families, i.e., OpenSora and CogVideoX. Extensive experiments demonstrate the importance of early intervention in denoising steps and block-wise model layers, revealing the essential factor for multi-event video generation and highlighting the possibilities for multi-event conditioning in future models.
Problem

Research questions and friction points this paper is trying to address.

Determining when and where event transitions occur in multi-event video generation
Investigating how prompts control event switching in text-to-video models
Identifying essential factors for temporal coherence in multi-event videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Early intervention in denoising steps
Block-wise model layers conditioning
Systematic evaluation with MEve prompt suite
🔎 Similar Papers
No similar papers found.