Generating Dialogues from Egocentric Instructional Videos for Task Assistance: Dataset, Method and Benchmark

📅 2025-08-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing dialogue video datasets severely lack expert-guided, multi-turn dialogues grounded in realistic procedural tasks (e.g., cooking, mechanical repair, gardening), hindering the development of AI assistants for stepwise task execution. To address this, we propose the first fully automated pipeline—requiring no manual annotation—to transform first-person instructional videos into fine-grained, step-aligned expert–novice dialogues. Our method innovatively integrates large language model–based question-answer generation, video segmentation, temporal step alignment, and wearable-device audiovisual signal modeling to capture user context. Leveraging this framework, we introduce HowToDIV, a new benchmark dataset comprising 507 dialogues, 6,636 QA pairs, and 24 hours of aligned video segments spanning diverse complex procedural domains. Additionally, we establish Gemma-3, the first dedicated evaluation benchmark for procedural dialogue assistance, enabling standardized assessment of instruction-following and step-grounded reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Many everyday tasks ranging from fixing appliances, cooking recipes to car maintenance require expert knowledge, especially when tasks are complex and multi-step. Despite growing interest in AI agents, there is a scarcity of dialogue-video datasets grounded for real world task assistance. In this paper, we propose a simple yet effective approach that transforms single-person instructional videos into task-guidance two-person dialogues, aligned with fine grained steps and video-clips. Our fully automatic approach, powered by large language models, offers an efficient alternative to the substantial cost and effort required for human-assisted data collection. Using this technique, we build HowToDIV, a large-scale dataset containing 507 conversations, 6636 question-answer pairs and 24 hours of videoclips across diverse tasks in cooking, mechanics, and planting. Each session includes multi-turn conversation where an expert teaches a novice user how to perform a task step by step, while observing user's surrounding through a camera and microphone equipped wearable device. We establish the baseline benchmark performance on HowToDIV dataset through Gemma-3 model for future research on this new task of dialogues for procedural-task assistance.
Problem

Research questions and friction points this paper is trying to address.

Generating dialogues from instructional videos for task assistance
Transforming single-person videos into two-person guidance dialogues
Creating dataset for AI task assistance with video-dialogue alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Convert instructional videos into dialogues
Use large language models automatically
Align dialogues with video steps
🔎 Similar Papers
No similar papers found.