🤖 AI Summary
This work addresses the challenges of poor inter-frame consistency among multiple characters and subject identity confusion (i.e., attention leakage) in text-to-image generation. Methodologically, we propose a training-free solution featuring a bounded cross-frame self-attention mechanism to suppress spurious inter-frame interference, coupled with a token fusion layer for fine-grained semantic alignment. We further introduce the first multimodal chain-of-reasoning framework integrated with region pre-localization, enabling long-term, stable modeling of characters and objects. Without introducing additional parameters or requiring fine-tuning, our approach achieves state-of-the-art performance across all key dimensions—multi-character consistency, identity discriminability, and detail fidelity. Extensive evaluations on multiple visual storytelling benchmarks demonstrate significant qualitative and quantitative improvements over existing methods.
📝 Abstract
Training-free consistent text-to-image generation depicting the same subjects across different images is a topic of widespread recent interest. Existing works in this direction predominantly rely on cross-frame self-attention; which improves subject-consistency by allowing tokens in each frame to pay attention to tokens in other frames during self-attention computation. While useful for single subjects, we find that it struggles when scaling to multiple characters. In this work, we first analyze the reason for these limitations. Our exploration reveals that the primary-issue stems from self-attention-leakage, which is exacerbated when trying to ensure consistency across multiple-characters. This happens when tokens from one subject pay attention to other characters, causing them to appear like each other (e.g., a dog appearing like a duck). Motivated by these findings, we propose StoryBooth: a training-free approach for improving multi-character consistency. In particular, we first leverage multi-modal chain-of-thought reasoning and region-based generation to apriori localize the different subjects across the desired story outputs. The final outputs are then generated using a modified diffusion model which consists of two novel layers: 1) a bounded cross-frame self-attention layer for reducing inter-character attention leakage, and 2) token-merging layer for improving consistency of fine-grain subject details. Through both qualitative and quantitative results we find that the proposed approach surpasses prior state-of-the-art, exhibiting improved consistency across both multiple-characters and fine-grain subject details.