Infinite-Story: A Training-Free Consistent Text-to-Image Generation

πŸ“… 2025-11-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

193K/year
πŸ€– AI Summary
To address identity and style inconsistency in text-to-image generation across multiple prompts, this paper proposes a training-free, inference-time consistency control framework. The method employs a scale-autoregressive architecture and introduces a novel identity prompt replacement mechanism to mitigate contextual bias in the text encoder; it further incorporates unified attention guidance and adaptive style injection modules to ensure cross-prompt identity preservation and style stability. All operations are performed solely during inference, requiring no model fine-tuning. Experiments demonstrate that the approach significantly improves image-text consistency while maintaining prompt fidelity, achieving state-of-the-art performance. It processes images at 1.72 seconds per sampleβ€”over six times faster than the current fastest method.

Technology Category

Application Category

πŸ“ Abstract
We present Infinite-Story, a training-free framework for consistent text-to-image (T2I) generation tailored for multi-prompt storytelling scenarios. Built upon a scale-wise autoregressive model, our method addresses two key challenges in consistent T2I generation: identity inconsistency and style inconsistency. To overcome these issues, we introduce three complementary techniques: Identity Prompt Replacement, which mitigates context bias in text encoders to align identity attributes across prompts; and a unified attention guidance mechanism comprising Adaptive Style Injection and Synchronized Guidance Adaptation, which jointly enforce global style and identity appearance consistency while preserving prompt fidelity. Unlike prior diffusion-based approaches that require fine-tuning or suffer from slow inference, Infinite-Story operates entirely at test time, delivering high identity and style consistency across diverse prompts. Extensive experiments demonstrate that our method achieves state-of-the-art generation performance, while offering over 6X faster inference (1.72 seconds per image) than the existing fastest consistent T2I models, highlighting its effectiveness and practicality for real-world visual storytelling.
Problem

Research questions and friction points this paper is trying to address.

Addresses identity inconsistency in multi-prompt text-to-image storytelling
Solves style inconsistency across different prompts in visual generation
Eliminates need for fine-tuning while maintaining prompt fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework for consistent text-to-image generation
Identity Prompt Replacement mitigates context bias in encoders
Unified attention guidance ensures global style consistency
πŸ”Ž Similar Papers
No similar papers found.