🤖 AI Summary
Existing approaches to multi-step task understanding in human-robot dialogue interaction are constrained by fragment-level modeling, limiting their ability to exploit long-term sequential context—resulting in inaccurate action confirmation and degraded planning performance. To address this, we propose a unified framework integrating a long-context-aware Q-Former with a multimodal large language model (MLLM). The Q-Former explicitly captures full-video temporal dependencies, while a text-conditioned input mechanism mitigates semantic abstraction and improves cross-modal alignment. Our multimodal Transformer architecture directly feeds text embeddings into the LLM decoder and incorporates VideoLLaMA3. Experiments on YouCook2 demonstrate a significant improvement in action confirmation accuracy; moreover, explicit long-context modeling substantially enhances end-to-end action planning performance.
📝 Abstract
Human-robot collaboration towards a shared goal requires robots to understand human action and interaction with the surrounding environment. This paper focuses on human-robot interaction (HRI) based on human-robot dialogue that relies on the robot action confirmation and action step generation using multimodal scene understanding. The state-of-the-art approach uses multimodal transformers to generate robot action steps aligned with robot action confirmation from a single clip showing a task composed of multiple micro steps. Although actions towards a long-horizon task depend on each other throughout an entire video, the current approaches mainly focus on clip-level processing and do not leverage long-context information. This paper proposes a long-context Q-former incorporating left and right context dependency in full videos. Furthermore, this paper proposes a text-conditioning approach to feed text embeddings directly into the LLM decoder to mitigate the high abstraction of the information in text by Q-former. Experiments with the YouCook2 corpus show that the accuracy of confirmation generation is a major factor in the performance of action planning. Furthermore, we demonstrate that the long-context Q-former improves the confirmation and action planning by integrating VideoLLaMA3.