EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the challenge of maintaining consistent appearance of recurring entities across multi-shot video generation while faithfully adhering to per-shot textual prompts. The authors propose a training-free, entity-centric memory mechanism that stores key appearance information in a latent image patch bank indexed by entity identity. To enhance both consistency and computational efficiency, they integrate a sparse token-based conditioning scheme that restricts the scope of self-attention. The approach further incorporates a structured multi-shot script format, a budgeted memory update strategy, and noise-injected appearance control. This design significantly improves prompt fidelity while effectively mitigating irrelevant information leakage, achieving a strong balance between subject consistency and generation efficiency.

📝 Abstract

Multi-shot video generation requires maintaining a consistent appearance of recurring entities across shots while remaining faithful to shot-specific text prompts. Recent autoregressive methods reuse previously generated frames as memory. However, full-frame storage entangles persistent entity information with transient scene context, leading to irrelevant information leakage and high computational cost. We propose an entity-centric memory in the form of an entity-indexed bank of latent patches. We introduce sparse token conditioning compatible with pretrained models, restricting self-attention to entity-relevant tokens and reducing computational cost. To support this, we introduce a structured multi-shot script format. We additionally propose a budgeted memory update strategy to maintain a compact, evolving memory. Finally, we equip the entity representation with a noise-injection mechanism that enables fine-grained appearance control, preventing leakage of irrelevant information. Our method improves prompt adherence and efficiency while preserving subject consistency.

Problem

Research questions and friction points this paper is trying to address.

multi-shot video generation

entity consistency

memory efficiency

prompt adherence

video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

entity-centric memory

sparse token conditioning

multi-shot video generation