🤖 AI Summary
Existing LLM evaluation benchmarks inadequately capture lifelong learning capabilities—such as sustained role behavior, self-awareness, episodic memory, and relational tracking—in multi-turn, multi-agent interactions. To address this gap, we introduce LIFESTATE-BENCH, the first systematic benchmark for evaluating emergent lifelong learning in LLMs. Grounded in an adapted script of *Hamlet*, it establishes a dual-modal (narrative structure + character consistency) episodic evaluation framework, overcoming limitations of static, single-turn assessment. We propose a fact-checking–style evaluation protocol integrating introspective question-answering with cross-turn state verification, and comparatively analyze parametric (fine-tuning/prompting) versus non-parametric (retrieval-augmented generation/external memory) approaches. Experiments reveal that non-parametric methods significantly outperform parametric ones; moreover, leading models—including Llama3.1-8B, GPT-4-turbo, and DeepSeek R1—exhibit catastrophic forgetting in long-horizon interactions, exposing a fundamental bottleneck in current lifelong learning capabilities.
📝 Abstract
Large language models (LLMs) can carry out human-like dialogue, but unlike humans, they are stateless due to the superposition property. However, during multi-turn, multi-agent interactions, LLMs begin to exhibit consistent, character-like behaviors, hinting at a form of emergent lifelong learning. Despite this, existing benchmarks often fail to capture these dynamics, primarily focusing on static, open-ended evaluations. To address this gap, we introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in LLMs. It features two episodic datasets: Hamlet and a synthetic script collection, rich in narrative structure and character interactions. Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches. Experiments on models like Llama3.1-8B, GPT-4-turbo, and DeepSeek R1, we demonstrate that nonparametric methods significantly outperform parametric ones in managing stateful learning. However, all models exhibit challenges with catastrophic forgetting as interactions extend, highlighting the need for further advancements in lifelong learning.