🤖 AI Summary
Current RGB-video task demonstrations struggle to capture fine-grained contextual cues—such as user intent, safety constraints, and individual preferences—limiting vision-language models’ (VLMs) task understanding and adaptability. This work addresses real-world task assistance scenarios and, for the first time, systematically reveals the task-type-dependent complementarity between implicit eye-tracking and explicit speech modalities. We propose MICA, a multimodal intention-aware framework enabling subtask segmentation, keyframe–semantic pair extraction, and context-enhanced VLM question answering from a single demonstration. MICA jointly processes eye trajectories and speech transcriptions via temporal segmentation, intention-driven caption generation, and context-reweighted reasoning. Experiments show that multimodal cues significantly outperform frame-only retrieval baselines; eye tracking alone achieves 93% of speech’s performance, while their fusion yields the highest accuracy—demonstrating the fundamental limitations of frame-level representations.
📝 Abstract
A person's demonstration often serves as a key reference for others learning the same task. However, RGB video, the dominant medium for representing these demonstrations, often fails to capture fine-grained contextual cues such as intent, safety-critical environmental factors, and subtle preferences embedded in human behavior. This sensory gap fundamentally limits the ability of Vision Language Models (VLMs) to reason about why actions occur and how they should adapt to individual users. To address this, we introduce MICA (Multimodal Interactive Contextualized Assistance), a framework that improves conversational agents for task assistance by integrating eye gaze and speech cues. MICA segments demonstrations into meaningful sub-tasks and extracts keyframes and captions that capture fine-grained intent and user-specific cues, enabling richer contextual grounding for visual question answering. Evaluations on questions derived from real-time chat-assisted task replication show that multimodal cues significantly improve response quality over frame-based retrieval. Notably, gaze cues alone achieves 93% of speech performance, and their combination yields the highest accuracy. Task type determines the effectiveness of implicit (gaze) vs. explicit (speech) cues, underscoring the need for adaptable multimodal models. These results highlight the limitations of frame-based context and demonstrate the value of multimodal signals for real-world AI task assistance.