If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs

📅 2025-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM evaluation benchmarks inadequately capture lifelong learning capabilities—such as sustained role behavior, self-awareness, episodic memory, and relational tracking—in multi-turn, multi-agent interactions. To address this gap, we introduce LIFESTATE-BENCH, the first systematic benchmark for evaluating emergent lifelong learning in LLMs. Grounded in an adapted script of *Hamlet*, it establishes a dual-modal (narrative structure + character consistency) episodic evaluation framework, overcoming limitations of static, single-turn assessment. We propose a fact-checking–style evaluation protocol integrating introspective question-answering with cross-turn state verification, and comparatively analyze parametric (fine-tuning/prompting) versus non-parametric (retrieval-augmented generation/external memory) approaches. Experiments reveal that non-parametric methods significantly outperform parametric ones; moreover, leading models—including Llama3.1-8B, GPT-4-turbo, and DeepSeek R1—exhibit catastrophic forgetting in long-horizon interactions, exposing a fundamental bottleneck in current lifelong learning capabilities.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) can carry out human-like dialogue, but unlike humans, they are stateless due to the superposition property. However, during multi-turn, multi-agent interactions, LLMs begin to exhibit consistent, character-like behaviors, hinting at a form of emergent lifelong learning. Despite this, existing benchmarks often fail to capture these dynamics, primarily focusing on static, open-ended evaluations. To address this gap, we introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in LLMs. It features two episodic datasets: Hamlet and a synthetic script collection, rich in narrative structure and character interactions. Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches. Experiments on models like Llama3.1-8B, GPT-4-turbo, and DeepSeek R1, we demonstrate that nonparametric methods significantly outperform parametric ones in managing stateful learning. However, all models exhibit challenges with catastrophic forgetting as interactions extend, highlighting the need for further advancements in lifelong learning.
Problem

Research questions and friction points this paper is trying to address.

Assessing lifelong learning in LLMs during multi-turn interactions
Evaluating self-awareness and memory retrieval in character-like LLMs
Addressing catastrophic forgetting in stateful learning approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

LIFESTATE-BENCH benchmark for lifelong learning
Nonparametric methods outperform parametric approaches
Evaluates self-awareness and episodic memory retrieval
🔎 Similar Papers
No similar papers found.
S
Siqi Fan
University of Electronic Science and Technology of China, Chengdu, China
X
Xiusheng Huang
Beijing Academy of Artificial Intelligence, Beijing, China
Yiqun Yao
Yiqun Yao
Unknown affiliation
X
Xuezhi Fang
Beijing Academy of Artificial Intelligence, Beijing, China
K
Kang Liu
Institute of Computing Automation, Chinese Academy of Sciences, Beijing, China
Peng Han
Peng Han
Professor, Department of Computer Science, UESTC
drug discoveryspatial temporaldata mining
Shuo Shang
Shuo Shang
Computer Science & AI Scientist
Spatial dataSpatiotemporal databases
A
Aixin Sun
Nanyang Technological University, Singapore
Y
Yequan Wang
Beijing Academy of Artificial Intelligence, Beijing, China