π€ AI Summary
This work addresses the challenge of generating temporally coherent multi-character image sequences that simultaneously preserve character identity consistency and narrative dynamismβa trade-off often leading existing methods to either distort character appearances or stagnate plot progression. To overcome this, the authors propose RealDiffusion, a unified framework that introduces thermal diffusion as a dissipative prior to stabilize character features while employing a region-aware stochastic process to drive pose and scene evolution. A key innovation is a training-free physics-informed attention mechanism that models feature dynamics as a configurable physical system, enabling the injection of controllable physical priors during inference to jointly optimize spatiotemporal consistency and prompt-driven variation. Experiments demonstrate that the proposed method significantly outperforms current approaches in both character consistency and narrative expressiveness.
π Abstract
While modern diffusion models excel at generating diverse single images, extending this to sequential generation reveals a fundamental challenge: balancing narrative dynamism with multi-character coherence. Existing methods often falter at this trade-off, leading to artifacts where characters lose their identity or the story stagnates. To resolve this critical tension, we introduce RealDiffusion, a unified framework designed to reconcile robust coherence with narrative dynamism. Heat diffusion serves as a dissipative prior that averages neighboring features along the sequence and removes high-frequency noise within the subject region. This suppresses attribute drift and stabilizes identity across frames. A region-aware stochastic process then introduces small perturbations that explore nearby modes and prevent collapse so the story maintains pose change and scene evolution. We thus introduce a lightweight, training-free Physics-informed Attention mechanism that injects controllable physical priors into the self-attention layers during inference. By modeling feature evolution as a configurable physical system, our method regularizes spatio-temporal relationships without suppressing intentional, prompt-driven changes. Extensive experiments demonstrate that RealDiffusion achieves substantial gains in character coherence while preserving narrative dynamism, outperforming state-of-the-art approaches. Code is available at https://github.com/ShmilyQi-CN/RealDiffusion.