🤖 AI Summary
This work proposes Doki, a novel system that fully integrates generative video creation into freeform text-based writing, establishing a pure-text-centric interaction paradigm. Addressing the limitations of conventional video tools—which often hinder natural narrative expression akin to textual composition—Doki enables end-to-end text-driven video production within a single document. It supports asset definition, scene orchestration, shot generation, editing refinement, and audio integration through a synergistic combination of natural language parsing, generative video models, and an interactive document interface. Through multi-case analyses and a week-long user study, the system demonstrates strong usability and expressive capacity across users of varying skill levels, significantly lowering the barrier to entry while enhancing the efficiency and intuitiveness of visual storytelling.
📝 Abstract
Everyone can write their stories in freeform text format -- it's something we all learn in school. Yet storytelling via video requires one to learn specialized and complicated tools. In this paper, we introduce Doki, a text-native interface for generative video authoring, aligning video creation with the natural process of text writing. In Doki, writing text is the primary interaction: within a single document, users define assets, structure scenes, create shots, refine edits, and add audio. We articulate the design principles of this text-first approach and demonstrate Doki's capabilities through a series of examples. To evaluate its real-world use, we conducted a week-long deployment study with participants of varying expertise in video authoring. This work contributes a fundamental shift in generative video interfaces, demonstrating a powerful and accessible new way to craft visual stories.