🤖 AI Summary
This work addresses the challenge of cross-frame semantic interference in multi-frame visual story generation, where existing methods struggle to balance character identity consistency with per-frame semantic specificity due to entangled prompt fusion. The authors propose a training-free inference-stage approach that, for the first time within a diffusion model framework, achieves region-wise disentanglement and inter-frame decorrelation of text prompt embeddings. By decomposing textual embeddings into identity-related and frame-specific components and suppressing shared directions across frames, the method effectively mitigates semantic interference. Notably, it requires no model parameter modifications or additional supervision, yet significantly outperforms the 1Prompt1Story baseline on the ConsiStory+ benchmark, achieving consistent improvements across multiple identity consistency metrics.
📝 Abstract
Generating coherent visual stories requires maintaining subject identity across multiple images while preserving frame-specific semantics. Recent training-free methods concatenate identity and frame prompts into a unified representation, but this often introduces inter-frame semantic interference that weakens identity preservation in complex stories. We propose ReDiStory, a training-free framework that improves multi-frame story generation via inference-time prompt embedding reorganization. ReDiStory explicitly decomposes text embeddings into identity-related and frame-specific components, then decorrelates frame embeddings by suppressing shared directions across frames. This reduces cross-frame interference without modifying diffusion parameters or requiring additional supervision. Under identical diffusion backbones and inference settings, ReDiStory improves identity consistency while maintaining prompt fidelity. Experiments on the ConsiStory+ benchmark show consistent gains over 1Prompt1Story on multiple identity consistency metrics. Code is available at: https://github.com/YuZhenyuLindy/ReDiStory