Useful Memories Become Faulty When Continuously Updated by LLMs

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This study addresses the detrimental impact of frequent memory integration in large language models (LLMs) during continual learning, which often introduces errors and degrades task performance. Within the ARC-AGI Stream environment, the authors construct controlled memory operations—retention, deletion, and integration—to systematically evaluate models such as GPT-5.4 under episodic-only versus consolidated memory strategies. Their experiments provide the first empirical evidence that forced memory integration significantly impairs performance, yielding a 54% failure rate, whereas prioritizing the retention of original experiential trajectories doubles accuracy. Notably, entirely disabling integration achieves comparable gains. These findings expose fundamental flaws in current LLM memory integration mechanisms and propose a novel “original-trajectory-first” paradigm for memory management.

📝 Abstract

Learning from past experience benefits from two complementary forms of memory: episodic traces -- raw trajectories of what happened -- and consolidated abstractions distilled across many episodes into reusable, schema-like lessons. Recent agentic-memory systems pursue the consolidated form: an LLM rewrites past trajectories into a textual memory bank that it continuously updates with new interactions, promising self-improving agents without parameter updates. Yet we find that such consolidated memories produced by today's LLMs are often faulty even when derived from useful experiences. As consolidation proceeds, memory utility first rises, then degrades, and can fall below the no-memory baseline. More surprisingly, even when consolidating from ground-truth solutions, GPT-5.4 fails on 54% of a set of ARC-AGI problems it had previously solved without memory. We trace the regression to the consolidation step rather than the underlying experience: the same trajectories yield qualitatively different memories under different update schedules, and an episodic-only control that simply retains those trajectories remains competitive with the consolidators we test. In a controlled ARC-AGI Stream environment that exposes Retain, Delete, and Consolidate actions, agents preserve raw episodes by default and double the accuracy of their forced-consolidation counterparts; disabling consolidation entirely (episodic management only) matches this auto regime. Practically, robust agent memory should treat raw episodes as first-class evidence and gate consolidation explicitly rather than firing it after every interaction. Looking forward, reliable agentic memory will require LLMs that can consolidate without overwriting the evidence they depend on.

Problem

Research questions and friction points this paper is trying to address.

consolidated memory

memory degradation

LLM-based agents

episodic memory

memory consolidation

Innovation

Methods, ideas, or system contributions that make the work stand out.

memory consolidation

episodic memory

LLM-based agents