🤖 AI Summary
Current generative AI tools lack task decomposition capabilities, systematic iterative refinement mechanisms, and support for exploratory generation-space navigation—limiting high-quality, personalized multimodal media creation. To address this, we propose DeckFlow, a specification-driven generative AI tool designed for multimodal (text/image/audio) content synthesis. Methodologically, DeckFlow integrates: (1) an infinite visual dataflow canvas enabling interconnected subtask management; (2) annotation clustering to support hierarchical goal decomposition and progressive specification refinement; and (3) grid-based multi-variant generation coupled with recursive feedback loops for systematic exploration of the generative space. Technically, it unifies multimodal foundation models, a visual dataflow interface, clustering-guided specification annotation, prompt variant sampling, and iterative scaffolding design. Empirical evaluation demonstrates that DeckFlow significantly outperforms conversational-AI baselines on text-to-image generation and effectively supports cross-modal creative workflows and structured user participation.
📝 Abstract
Generative AI promises to allow people to create high-quality personalized media. Although powerful, we identify three fundamental design problems with existing tooling through a literature review. We introduce a multimodal generative AI tool, DeckFlow, to address these problems. First, DeckFlow supports task decomposition by allowing users to maintain multiple interconnected subtasks on an infinite canvas populated by cards connected through visual dataflow affordances. Second, DeckFlow supports a specification decomposition workflow where an initial goal is iteratively decomposed into smaller parts and combined using feature labels and clusters. Finally, DeckFlow supports generative space exploration by generating multiple prompt and output variations, presented in a grid, that can feed back recursively into the next design iteration. We evaluate DeckFlow for text-to-image generation against a state-of-practice conversational AI baseline for image generation tasks. We then add audio generation and investigate user behaviors in a more open-ended creative setting with text, image, and audio outputs.