🤖 AI Summary
This work addresses the fundamental challenge of uncontrollable event switching timing and positioning in multi-event text-to-video (T2V) generation. We systematically investigate, for the first time, the spatiotemporal mechanisms by which event prompts govern generation dynamics. To this end, we introduce MEve—a novel, self-constructed prompt benchmark specifically designed for multi-event T2V evaluation—and employ hierarchical activation analysis alongside stepwise ablation studies. Our findings reveal that event transitions are predominantly governed by the initial denoising steps and spatiotemporal attention modules, enabling us to formulate a principled spatiotemporal control criterion for event switching. Experiments on OpenSora and CogVideoX validate the efficacy and interpretability of this mechanism. The study establishes the first explanatory foundation for multi-event T2V generation and identifies a new design direction—explicit modeling of event temporal controllability—in next-generation T2V architectures.
📝 Abstract
Text-to-video (T2V) generation has surged in response to challenging questions, especially when a long video must depict multiple sequential events with temporal coherence and controllable content. Existing methods that extend to multi-event generation omit an inspection of the intrinsic factor in event shifting. The paper aims to answer the central question: When and where multi-event prompts control event transition during T2V generation. This work introduces MEve, a self-curated prompt suite for evaluating multi-event text-to-video (T2V) generation, and conducts a systematic study of two representative model families, i.e., OpenSora and CogVideoX. Extensive experiments demonstrate the importance of early intervention in denoising steps and block-wise model layers, revealing the essential factor for multi-event video generation and highlighting the possibilities for multi-event conditioning in future models.