Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM

📅 2025-11-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches to multi-step task understanding in human-robot dialogue interaction are constrained by fragment-level modeling, limiting their ability to exploit long-term sequential context—resulting in inaccurate action confirmation and degraded planning performance. To address this, we propose a unified framework integrating a long-context-aware Q-Former with a multimodal large language model (MLLM). The Q-Former explicitly captures full-video temporal dependencies, while a text-conditioned input mechanism mitigates semantic abstraction and improves cross-modal alignment. Our multimodal Transformer architecture directly feeds text embeddings into the LLM decoder and incorporates VideoLLaMA3. Experiments on YouCook2 demonstrate a significant improvement in action confirmation accuracy; moreover, explicit long-context modeling substantially enhances end-to-end action planning performance.

Technology Category

Application Category

📝 Abstract
Human-robot collaboration towards a shared goal requires robots to understand human action and interaction with the surrounding environment. This paper focuses on human-robot interaction (HRI) based on human-robot dialogue that relies on the robot action confirmation and action step generation using multimodal scene understanding. The state-of-the-art approach uses multimodal transformers to generate robot action steps aligned with robot action confirmation from a single clip showing a task composed of multiple micro steps. Although actions towards a long-horizon task depend on each other throughout an entire video, the current approaches mainly focus on clip-level processing and do not leverage long-context information. This paper proposes a long-context Q-former incorporating left and right context dependency in full videos. Furthermore, this paper proposes a text-conditioning approach to feed text embeddings directly into the LLM decoder to mitigate the high abstraction of the information in text by Q-former. Experiments with the YouCook2 corpus show that the accuracy of confirmation generation is a major factor in the performance of action planning. Furthermore, we demonstrate that the long-context Q-former improves the confirmation and action planning by integrating VideoLLaMA3.
Problem

Research questions and friction points this paper is trying to address.

Generating robot action confirmations through multimodal scene understanding
Planning long-horizon tasks using full video context dependencies
Mitigating information abstraction in text embeddings for action planning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-context Q-Former integrates full video dependencies
Text embeddings directly fed into LLM decoder
Multimodal integration improves confirmation and action planning
🔎 Similar Papers
No similar papers found.