🤖 AI Summary
To address the challenges of multi-step procedural task comprehension and weak error detection among middle school students, this paper introduces the first context-aware multimodal virtual assistant designed specifically for pedagogical settings. Methodologically, it integrates video-driven automatic task graph construction with a joint training framework leveraging multimodal large language models, enabling real-time screen/video stream analysis, natural language interaction, and dynamic task-state modeling. Key innovations include video–text alignment pretraining, online visual stream processing, and interpretable task graph reasoning. The system achieves state-of-the-art performance on four core benchmarks—task identification, action recognition, next-step prediction, and plan prediction—and significantly outperforms baselines on two novel automated error identification subtasks, demonstrating its effectiveness in deep semantic understanding of learning processes and fine-grained anomaly diagnosis.
📝 Abstract
The improved competence of generative models can help building multi-modal virtual assistants that leverage modalities beyond language. By observing humans performing multi-step tasks, one can build assistants that have situational awareness of actions and tasks being performed, enabling them to cater assistance based on this understanding. In this paper, we develop a Context-aware Instructional Task Assistant with Multi-modal Large Language Models (InsTALL) that leverages an online visual stream (e.g. a user's screen share or video recording) and responds in real-time to user queries related to the task at hand. To enable useful assistance, InsTALL 1) trains a multi-modal model on task videos and paired textual data, and 2) automatically extracts task graph from video data and leverages it at training and inference time. We show InsTALL achieves state-of-the-art performance across proposed sub-tasks considered for multimodal activity understanding -- task recognition (TR), action recognition (AR), next action prediction (AP), and plan prediction (PP) -- and outperforms existing baselines on two novel sub-tasks related to automatic error identification.