BOOKAGENT: Orchestrating Safety-Aware Visual Narratives via Multi-Agent Cognitive Calibration

📅 2026-04-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

205K/year
🤖 AI Summary
This work addresses the limitations of existing multimodal approaches to illustrated children’s story generation, which often lack holistic cross-modal alignment and neglect child-oriented safety constraints. To overcome these challenges, we propose BookAgent—a safety-aware multi-agent collaborative framework that generates complete storybooks end-to-end from user-provided drafts. BookAgent jointly orchestrates plot planning, textual narration, and illustration generation, incorporating a global inconsistency resolution mechanism. Crucially, child safety constraints are innovatively embedded into both narrative planning and sequence-level multimodal verification, enabling dynamic cognitive calibration to achieve page-level text-image alignment and cross-page temporal coherence. Experimental results demonstrate that BookAgent significantly outperforms current methods in narrative coherence, visual consistency, and safety compliance, establishing a robust paradigm for complex multimodal creative generation.

Technology Category

Application Category

📝 Abstract
Recent advancements in Large Generative Models (LGMs) have revolutionized multi-modal generation. However, generating illustrated storybooks remains an open challenge, where prior works mainly decompose this task into separate stages, and thus, holistic multi-modal grounding remains limited. Besides, while safety alignment is studied for text- or image-only generation, existing works rarely integrate child-specific safety constraints into narrative planning and sequence-level multi-modal verification. To address these limitations, we propose BookAgent, a safety-aware multi-agent collaboration framework designed for high-quality, safety-aware visual narratives. Different from prior story visualization models that assume a fixed storyline sequence, BookAgent targets end-to-end storybook synthesis from a user draft by jointly planning, scripting, illustrating, and globally repairing inconsistencies. To ensure precise multi-modal grounding, BookAgent dynamically calibrates page-level alignment between textual scripts and visual layouts. Furthermore, BookAgent calibrates holistic consistency from the temporal dimension, by verifying-then-rectifying global inconsistencies in character identity and storytelling logic. Extensive experiments demonstrate that BookAgent significantly outperforms current methods in narrative coherence, visual consistency, and safety compliance, offering a robust paradigm for reliable agents in complex multi-modal creation. The implementation will be publicly released at https://github.com/bogao-code/BookAgent/tree/main.
Problem

Research questions and friction points this paper is trying to address.

visual narratives
multi-modal grounding
safety alignment
storybook generation
child-specific safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent collaboration
safety-aware generation
multi-modal grounding
visual narrative
cognitive calibration