🤖 AI Summary
This work addresses the challenge of long-range inconsistency in multi-frame story illustration generation, where character identity, layout, and emotional expression often drift across frames. To mitigate this, the authors propose the S2ED framework, which employs a multi-agent collaboration mechanism to decompose a full narrative into explicit, editable executable descriptions, enabling coherent narrative segmentation, anchored character attributes, and enhanced spatial-emotional cues. S2ED further introduces a training-free, model-agnostic prompting layer that supports interpretable state propagation and localized editing to correct inter-frame drift. Evaluated on the Flintstones and Shakoo Maku datasets, S2ED outperforms strong prompting baselines, large language model planners, and trainable approaches in both automatic metrics and human assessments, and has been successfully integrated into an end-to-end children’s picture book generation system.
📝 Abstract
Multi-frame story illustration requires long-horizon coherence beyond single-image text-to-image generation, including narrative decomposition and persistent character identity, layout, and affect across frames. We propose Story-to-Executable Descriptions (S2ED), a training-free, model-agnostic, prompt-layer framework that converts a full story into a sequence of explicit, editable executable descriptions for more consistent rendering. S2ED coordinates three agents to segment the narrative, ground canonical character attributes, and enrich spatial and affective cues, enabling interpretable prompt-carried state propagation and local edits to repair drift without retraining the generator. Experiments on Flintstones and Shakoo Maku show that S2ED improves sequence-level consistency and character fidelity over strong prompting, large-model planning, and a reference training-based method, under both automatic metrics and human judgments. We also deploy S2ED in an end-to-end story-to-storybook system for children's illustrated stories, with a supplementary video.