🤖 AI Summary
To address memory limitations, contextual fragmentation, and insufficient personalization of large language models (LLMs) in robot-mediated instruction, this paper proposes a cognitively inspired multimodal memory architecture for human–robot collaborative teaching agents. Methodologically, it integrates multimodal perception interfaces, a hierarchical memory system featuring selective encoding and context-aware retrieval, and a dual-track decision mechanism balancing task execution and social interaction—enabling dynamic goal-directed behavior and naturalistic social engagement. The key contribution is the first integration of embodied memory modeling with LLM-based agents, supporting cross-session knowledge accumulation and generalizable reasoning. Empirical validation via human–robot interaction (HRI) user studies and synthetic data experiments demonstrates that the system autonomously advances training workflows, sustains long-horizon dialogue coherence, and significantly improves task completion rate (+28.6%) and interaction naturalness (p < 0.01).
📝 Abstract
Integrating robotics into everyday scenarios like tutoring or physical training requires robots capable of adaptive, socially engaging, and goal-oriented interactions. While Large Language Models show promise in human-like communication, their standalone use is hindered by memory constraints and contextual incoherence. This work presents a multimodal, cognitively inspired framework that enhances LLM-based autonomous decision-making in social and task-oriented Human-Robot Interaction. Specifically, we develop an LLM-based agent for a robot trainer, balancing social conversation with task guidance and goal-driven motivation. To further enhance autonomy and personalization, we introduce a memory system for selecting, storing and retrieving experiences, facilitating generalized reasoning based on knowledge built across different interactions. A preliminary HRI user study and offline experiments with a synthetic dataset validate our approach, demonstrating the system's ability to manage complex interactions, autonomously drive training tasks, and build and retrieve contextual memories, advancing socially intelligent robotics.