TaleDiffusion: Multi-Character Story Generation with Dialogue Rendering

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-story visualization methods struggle to maintain multi-character appearance and dialogue consistency across frames, leading to visual artifacts, dialogue misalignment, and narrative discontinuity. To address this, we propose an iterative multi-frame co-generation framework. Our key contributions are: (1) boundary-aware attention-driven per-frame masking for fine-grained regional control; (2) an identity-consistent self-attention mechanism that jointly integrates contextual learning and region-aware cross-attention to tightly bind character representations with speech bubbles; and (3) a unified pipeline combining large language model–generated frame descriptions and dialogue, CLIPSeg-based dialogue region localization, diffusion-based image synthesis, and post-processing refinement. Experiments demonstrate that our method significantly outperforms state-of-the-art approaches in character consistency, noise suppression, and dialogue visualization fidelity, thereby enhancing narrative coherence and visual plausibility in multi-character storytelling.

Technology Category

Application Category

📝 Abstract
Text-to-story visualization is challenging due to the need for consistent interaction among multiple characters across frames. Existing methods struggle with character consistency, leading to artifact generation and inaccurate dialogue rendering, which results in disjointed storytelling. In response, we introduce TaleDiffusion, a novel framework for generating multi-character stories with an iterative process, maintaining character consistency, and accurate dialogue assignment via postprocessing. Given a story, we use a pre-trained LLM to generate per-frame descriptions, character details, and dialogues via in-context learning, followed by a bounded attention-based per-box mask technique to control character interactions and minimize artifacts. We then apply an identity-consistent self-attention mechanism to ensure character consistency across frames and region-aware cross-attention for precise object placement. Dialogues are also rendered as bubbles and assigned to characters via CLIPSeg. Experimental results demonstrate that TaleDiffusion outperforms existing methods in consistency, noise reduction, and dialogue rendering.
Problem

Research questions and friction points this paper is trying to address.

Ensuring consistent multi-character interaction across frames
Reducing artifact generation and inaccurate dialogue rendering
Maintaining character consistency and precise object placement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative multi-character generation with postprocessing
Bounded attention-based mask for artifact control
Identity-consistent self-attention for cross-frame consistency
🔎 Similar Papers
No similar papers found.
A
Ayan Banerjee
Computer Vision Center, Universitat Autonoma de Barcelona
Josep Lladós
Josep Lladós
Computer Vision Center, Universitat Autònoma de Barcelona
Computer VisionPattern RecognitionDocument Analysis
U
Umapada Pal
Indian Statistical Institute, Kolkata
A
Anjan Dutta
Institute for People Centred Artificial Intelligence, University of Surrey