Grounding Task Assistance with Multimodal Cues from a Single Demonstration

📅 2025-05-02

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Current RGB-video task demonstrations struggle to capture fine-grained contextual cues—such as user intent, safety constraints, and individual preferences—limiting vision-language models’ (VLMs) task understanding and adaptability. This work addresses real-world task assistance scenarios and, for the first time, systematically reveals the task-type-dependent complementarity between implicit eye-tracking and explicit speech modalities. We propose MICA, a multimodal intention-aware framework enabling subtask segmentation, keyframe–semantic pair extraction, and context-enhanced VLM question answering from a single demonstration. MICA jointly processes eye trajectories and speech transcriptions via temporal segmentation, intention-driven caption generation, and context-reweighted reasoning. Experiments show that multimodal cues significantly outperform frame-only retrieval baselines; eye tracking alone achieves 93% of speech’s performance, while their fusion yields the highest accuracy—demonstrating the fundamental limitations of frame-level representations.

Technology Category

Application Category

📝 Abstract

A person's demonstration often serves as a key reference for others learning the same task. However, RGB video, the dominant medium for representing these demonstrations, often fails to capture fine-grained contextual cues such as intent, safety-critical environmental factors, and subtle preferences embedded in human behavior. This sensory gap fundamentally limits the ability of Vision Language Models (VLMs) to reason about why actions occur and how they should adapt to individual users. To address this, we introduce MICA (Multimodal Interactive Contextualized Assistance), a framework that improves conversational agents for task assistance by integrating eye gaze and speech cues. MICA segments demonstrations into meaningful sub-tasks and extracts keyframes and captions that capture fine-grained intent and user-specific cues, enabling richer contextual grounding for visual question answering. Evaluations on questions derived from real-time chat-assisted task replication show that multimodal cues significantly improve response quality over frame-based retrieval. Notably, gaze cues alone achieves 93% of speech performance, and their combination yields the highest accuracy. Task type determines the effectiveness of implicit (gaze) vs. explicit (speech) cues, underscoring the need for adaptable multimodal models. These results highlight the limitations of frame-based context and demonstrate the value of multimodal signals for real-world AI task assistance.

Problem

Research questions and friction points this paper is trying to address.

RGB video lacks fine-grained contextual cues in task demonstrations

Vision Language Models struggle to reason about actions and user adaptation

Current methods miss intent and user-specific cues in task assistance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates eye gaze and speech cues

Segments demonstrations into sub-tasks

Extracts keyframes and contextual captions

🔎 Similar Papers

Task-oriented Sequential Grounding and Navigation in 3D Scenes