From Videos to Conversations: Egocentric Instructions for Task Assistance

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the scarcity of large-scale, real-world multimodal task-guidance dialogue data that hinders the development of AR intelligent assistants. To overcome the limitations of costly and small-scale manual data collection, we propose the first fully automated framework that leverages large language models to transform single-person instructional videos into multi-turn, multimodal dialogues between expert and novice personas. Our approach integrates automatic dialogue generation, video-text semantic parsing, and multimodal alignment techniques to construct HowToDIV—a novel dataset comprising 507 dialogues, 6,636 question-answer pairs, and 24 hours of video. We establish baseline performance on Gemma 3 and Qwen 2.5, demonstrating a scalable new paradigm for multimodal dialogue data generation.

Technology Category

Application Category

📝 Abstract

Many everyday tasks, ranging from appliance repair and cooking to car maintenance, require expert knowledge, particularly for complex, multi-step procedures. Despite growing interest in AI agents for augmented reality (AR) assistance, progress remains limited by the scarcity of large-scale multimodal conversational datasets grounded in real-world task execution, in part due to the cost and logistical complexity of human-assisted data collection. In this paper, we present a framework to automatically transform single person instructional videos into two-person multimodal task-guidance conversations. Our fully automatic pipeline, based on large language models, provides a scalable and cost efficient alternative to traditional data collection approaches. Using this framework, we introduce HowToDIV, a multimodal dataset comprising 507 conversations, 6,636 question answer pairs, and 24 hours of video spanning multiple domains. Each session consists of a multi-turn expert-novice interaction. Finally, we report baseline results using Gemma 3 and Qwen 2.5 on HowToDIV, providing an initial benchmark for multimodal procedural task assistance.

Problem

Research questions and friction points this paper is trying to address.

multimodal conversational dataset

task assistance

egocentric video

AI agents

procedural tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal conversation generation

instructional video transformation

egocentric task assistance