If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs

📅 2025-03-30

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing LLM evaluation benchmarks inadequately capture lifelong learning capabilities—such as sustained role behavior, self-awareness, episodic memory, and relational tracking—in multi-turn, multi-agent interactions. To address this gap, we introduce LIFESTATE-BENCH, the first systematic benchmark for evaluating emergent lifelong learning in LLMs. Grounded in an adapted script of *Hamlet*, it establishes a dual-modal (narrative structure + character consistency) episodic evaluation framework, overcoming limitations of static, single-turn assessment. We propose a fact-checking–style evaluation protocol integrating introspective question-answering with cross-turn state verification, and comparatively analyze parametric (fine-tuning/prompting) versus non-parametric (retrieval-augmented generation/external memory) approaches. Experiments reveal that non-parametric methods significantly outperform parametric ones; moreover, leading models—including Llama3.1-8B, GPT-4-turbo, and DeepSeek R1—exhibit catastrophic forgetting in long-horizon interactions, exposing a fundamental bottleneck in current lifelong learning capabilities.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) can carry out human-like dialogue, but unlike humans, they are stateless due to the superposition property. However, during multi-turn, multi-agent interactions, LLMs begin to exhibit consistent, character-like behaviors, hinting at a form of emergent lifelong learning. Despite this, existing benchmarks often fail to capture these dynamics, primarily focusing on static, open-ended evaluations. To address this gap, we introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in LLMs. It features two episodic datasets: Hamlet and a synthetic script collection, rich in narrative structure and character interactions. Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches. Experiments on models like Llama3.1-8B, GPT-4-turbo, and DeepSeek R1, we demonstrate that nonparametric methods significantly outperform parametric ones in managing stateful learning. However, all models exhibit challenges with catastrophic forgetting as interactions extend, highlighting the need for further advancements in lifelong learning.

Problem

Research questions and friction points this paper is trying to address.

Assessing lifelong learning in LLMs during multi-turn interactions

Evaluating self-awareness and memory retrieval in character-like LLMs

Addressing catastrophic forgetting in stateful learning approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

LIFESTATE-BENCH benchmark for lifelong learning

Nonparametric methods outperform parametric approaches

Evaluates self-awareness and episodic memory retrieval

🔎 Similar Papers

Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching