π€ AI Summary
This work addresses the scarcity of large-scale, real-world multimodal task-guidance dialogue data that hinders the development of AR intelligent assistants. To overcome the limitations of costly and small-scale manual data collection, we propose the first fully automated framework that leverages large language models to transform single-person instructional videos into multi-turn, multimodal dialogues between expert and novice personas. Our approach integrates automatic dialogue generation, video-text semantic parsing, and multimodal alignment techniques to construct HowToDIVβa novel dataset comprising 507 dialogues, 6,636 question-answer pairs, and 24 hours of video. We establish baseline performance on Gemma 3 and Qwen 2.5, demonstrating a scalable new paradigm for multimodal dialogue data generation.
π Abstract
Many everyday tasks, ranging from appliance repair and cooking to car maintenance, require expert knowledge, particularly for complex, multi-step procedures. Despite growing interest in AI agents for augmented reality (AR) assistance, progress remains limited by the scarcity of large-scale multimodal conversational datasets grounded in real-world task execution, in part due to the cost and logistical complexity of human-assisted data collection. In this paper, we present a framework to automatically transform single person instructional videos into two-person multimodal task-guidance conversations. Our fully automatic pipeline, based on large language models, provides a scalable and cost efficient alternative to traditional data collection approaches. Using this framework, we introduce HowToDIV, a multimodal dataset comprising 507 conversations, 6,636 question answer pairs, and 24 hours of video spanning multiple domains. Each session consists of a multi-turn expert-novice interaction. Finally, we report baseline results using Gemma 3 and Qwen 2.5 on HowToDIV, providing an initial benchmark for multimodal procedural task assistance.