🤖 AI Summary
This work addresses the challenge of enhancing human-AI collaborative visual storytelling through embodied toy-based interaction. The Toyteller system unifies physical manipulation of anthropomorphized character icons (toy-like interaction) with AI-driven narrative generation and motion synthesis within a shared semantic space, enabling joint control by large language models (e.g., GPT-4o) and motion generation models. Methodologically, it integrates semantic space mapping, cross-modal translation, and HCI-informed interactive prototyping. Experimental evaluation demonstrates that Toyteller outperforms GPT-4o alone on action-guided text generation tasks. A user study confirms that tangible toy manipulation effectively conveys hard-to-articulate creative intent, though full expressivity requires multimodal input combining motion and language. The core contribution is the first embodied AI narrative framework supporting rich, intention-driven, tightly coupled multimodal interaction—bridging physical manipulation, linguistic description, and procedural motion generation in a unified architecture.
📝 Abstract
We introduce Toyteller, an AI-powered storytelling system where users generate a mix of story text and visuals by directly manipulating character symbols like they are toy-playing. Anthropomorphized symbol motions can convey rich and nuanced social interactions; Toyteller leverages these motions (1) to let users steer story text generation and (2) as a visual output format that accompanies story text. We enabled motion-steered text generation and text-steered motion generation by mapping motions and text onto a shared semantic space so that large language models and motion generation models can use it as a translational layer. Technical evaluations showed that Toyteller outperforms a competitive baseline, GPT-4o. Our user study identified that toy-playing helps express intentions difficult to verbalize. However, only motions could not express all user intentions, suggesting combining it with other modalities like language. We discuss the design space of toy-playing interactions and implications for technical HCI research on human-AI interaction.