🤖 AI Summary
This work addresses the multi-frame character identity consistency challenge in human-centered image story generation—particularly concerning high-fidelity facial preservation and cross-frame identity coherence. We propose an identity-aware fine-tuning framework built upon diffusion models, centered on two key innovations: iterative identity discovery and re-denoising-based identity injection. Our approach leverages CLIP-guided cross-frame identity alignment and iterative latent-space clustering to achieve precise, semantics-preserving identity control. To our knowledge, this is the first method to systematically resolve long-sequence, multi-character identity consistency. Evaluated on the ConsiStory-Human benchmark, it achieves a 23.6% improvement in ID-Retrieval accuracy, supports arbitrarily long story generation, enables real-time character composition, and attains a 91.2% success rate in multi-character scenes.
📝 Abstract
Recent visual generative models enable story generation with consistent characters from text, but human-centric story generation faces additional challenges, such as maintaining detailed and diverse human face consistency and coordinating multiple characters across different images. This paper presents IdentityStory, a framework for human-centric story generation that ensures consistent character identity across multiple sequential images. By taming identity-preserving generators, the framework features two key components: Iterative Identity Discovery, which extracts cohesive character identities, and Re-denoising Identity Injection, which re-denoises images to inject identities while preserving desired context. Experiments on the ConsiStory-Human benchmark demonstrate that IdentityStory outperforms existing methods, particularly in face consistency, and supports multi-character combinations. The framework also shows strong potential for applications such as infinite-length story generation and dynamic character composition.