🤖 AI Summary
This work addresses the challenge of verbalizing robots’ long-term experience—spanning months of multimodal perceptual and proprioceptive data—which hinders natural, interpretable human-robot interaction. We propose a lifelong experience verbalization framework that synergistically integrates a hierarchical memory tree (perception → event → linguistic concept) with large language models (LLMs). Methodologically, the tree-structured semantic representation enables temporally coherent organization of cross-episode experiences; LLM-driven dynamic retrieval, coupled with zero-/few-shot prompting, supports efficient summarization and interactive question answering over long-term memory. Our approach is the first to deeply unify hierarchical memory modeling with LLM-based retrieval, overcoming the limitations of conventional short-horizon, fragment-based memory architectures. Evaluated on simulated household robots, egocentric video datasets, and real-robot deployment data, the framework achieves significant improvements in low-overhead, high-fidelity verbalization of multi-month experiences.
📝 Abstract
Verbalization of robot experience, i.e., summarization of and question answering about a robot's past, is a crucial ability for improving human-robot interaction. Previous works applied rule-based systems or fine-tuned deep models to verbalize short (several-minute-long) streams of episodic data, limiting generalization and transferability. In our work, we apply large pretrained models to tackle this task with zero or few examples, and specifically focus on verbalizing life-long experiences. For this, we derive a tree-like data structure from episodic memory (EM), with lower levels representing raw perception and proprioception data, and higher levels abstracting events to natural language concepts. Given such a hierarchical representation built from the experience stream, we apply a large language model as an agent to interactively search the EM given a user's query, dynamically expanding (initially collapsed) tree nodes to find the relevant information. The approach keeps computational costs low even when scaling to months of robot experience data. We evaluate our method on simulated household robot data, human egocentric videos, and real-world robot recordings, demonstrating its flexibility and scalability.