EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the challenge of maintaining cross-shot consistency of characters, objects, and scenes in long-form multi-shot video generation. To this end, the authors propose EntityMem, a novel system that decouples consistency from generation quality through a persistent entity memory bank and a fidelity gating mechanism. They also introduce EntityBench, the first long-range consistency benchmark encompassing three entity types, comprising 140 episodes (2,491 shots) and a tiered entity scheduling protocol based on difficulty levels. Evaluation using a three-pillar framework—assessing intra-shot quality, prompt alignment, and cross-shot consistency—reveals that existing methods suffer significant consistency degradation as entity reappearance intervals increase. In contrast, EntityMem substantially outperforms baselines in character fidelity (Cohen’s d = +2.33) and presence consistency.

📝 Abstract

Multi-shot video generation extends single-shot generation to coherent visual narratives, yet maintaining consistent characters, objects, and locations across shots remains a challenge over long sequences. Existing evaluations typically use independently generated prompt sets with limited entity coverage and simple consistency metrics, making standardized comparison difficult. We introduce EntityBench, a benchmark of 140 episodes (2,491 shots) derived from real narrative media, with explicit per-shot entity schedules tracking characters, objects, and locations simultaneously across easy / medium / hard tiers of up to 50 shots, 13 cross-shot characters, 8 cross-shot locations, 22 cross-shot objects, and recurrence gaps spanning up to 48 shots. It is paired with a three-pillar evaluation suite that disentangles intra-shot quality, prompt-following alignment, and cross-shot consistency, with a fidelity gate that admits only accurate entity appearances into cross-shot scoring. As a baseline, we propose EntityMem, a memory-augmented generation system that stores verified per-entity visual references in a persistent memory bank before generation begins. Experiments show that cross-shot entity consistency degrades sharply with recurrence distance in existing methods, and that explicit per-entity memory yields the highest character fidelity (Cohen's d = +2.33) and presence among methods evaluated. Code and data are available at https://github.com/Catherine-R-He/EntityBench/.

Problem

Research questions and friction points this paper is trying to address.

multi-shot video generation

entity consistency

long-range video generation

cross-shot consistency

visual narrative

Innovation

Methods, ideas, or system contributions that make the work stand out.

EntityBench

multi-shot video generation

entity consistency