From Videos to Conversations: Egocentric Instructions for Task Assistance

πŸ“… 2026-02-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the scarcity of large-scale, real-world multimodal task-guidance dialogue data that hinders the development of AR intelligent assistants. To overcome the limitations of costly and small-scale manual data collection, we propose the first fully automated framework that leverages large language models to transform single-person instructional videos into multi-turn, multimodal dialogues between expert and novice personas. Our approach integrates automatic dialogue generation, video-text semantic parsing, and multimodal alignment techniques to construct HowToDIVβ€”a novel dataset comprising 507 dialogues, 6,636 question-answer pairs, and 24 hours of video. We establish baseline performance on Gemma 3 and Qwen 2.5, demonstrating a scalable new paradigm for multimodal dialogue data generation.

Technology Category

Application Category

πŸ“ Abstract
Many everyday tasks, ranging from appliance repair and cooking to car maintenance, require expert knowledge, particularly for complex, multi-step procedures. Despite growing interest in AI agents for augmented reality (AR) assistance, progress remains limited by the scarcity of large-scale multimodal conversational datasets grounded in real-world task execution, in part due to the cost and logistical complexity of human-assisted data collection. In this paper, we present a framework to automatically transform single person instructional videos into two-person multimodal task-guidance conversations. Our fully automatic pipeline, based on large language models, provides a scalable and cost efficient alternative to traditional data collection approaches. Using this framework, we introduce HowToDIV, a multimodal dataset comprising 507 conversations, 6,636 question answer pairs, and 24 hours of video spanning multiple domains. Each session consists of a multi-turn expert-novice interaction. Finally, we report baseline results using Gemma 3 and Qwen 2.5 on HowToDIV, providing an initial benchmark for multimodal procedural task assistance.
Problem

Research questions and friction points this paper is trying to address.

multimodal conversational dataset
task assistance
egocentric video
AI agents
procedural tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal conversation generation
instructional video transformation
egocentric task assistance
automatic dataset construction
large language models
πŸ”Ž Similar Papers
No similar papers found.