🤖 AI Summary
This work addresses the challenge of maintaining identity and appearance consistency for narrative entities—such as characters, props, and environments—across shots in video generation, particularly after prolonged absences. To this end, the authors propose an entity-centric generative framework that integrates a multi-agent system for script parsing and narrative decomposition, coupled with a dynamic memory bank that explicitly stores and updates visual and semantic representations of entities. This memory mechanism enables story-driven cross-shot consistency by retrieving relevant entity states to condition keyframe and video generation. Evaluated on a newly curated benchmark comprising 54 multi-shot narrative cases, the method demonstrates strong entity-level coherence and high perceptual quality in complex storytelling sequences.
📝 Abstract
Maintaining consistent characters, props, and environments across multiple shots is a central challenge in narrative video generation. Existing models can produce high-quality short clips but often fail to preserve entity identity and appearance when scenes change or when entities reappear after long temporal gaps. We present VideoMemory, an entity-centric framework that integrates narrative planning with visual generation through a Dynamic Memory Bank. Given a structured script, a multi-agent system decomposes the narrative into shots, retrieves entity representations from memory, and synthesizes keyframes and videos conditioned on these retrieved states. The Dynamic Memory Bank stores explicit visual and semantic descriptors for characters, props, and backgrounds, and is updated after each shot to reflect story-driven changes while preserving identity. This retrieval-update mechanism enables consistent portrayal of entities across distant shots and supports coherent long-form generation. To evaluate this setting, we construct a 54-case multi-shot consistency benchmark covering character-, prop-, and background-persistent scenarios. Extensive experiments show that VideoMemory achieves strong entity-level coherence and high perceptual quality across diverse narrative sequences.