Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work addresses the cognitive degradation and escalating computational overhead in large language models during extended scientific collaboration, caused by context saturation. To overcome these limitations, we propose a dual-process memory architecture that decouples short-term episodic memory (fixed to the latest 10 messages) from long-term semantic knowledge (growing at approximately 3 tokens per message). The framework integrates domain-specific knowledge compression, dual-channel episodic-semantic memory, and cross-model verification, enabling robust handling of parameter contradictions, multi-hop reasoning across collaboration stages, and precise retention of technical facts. Evaluated across six mainstream large language models, our system maintains 70–85% accuracy over 15,000 messages with only 1–2 seconds of latency, reduces token consumption by 62%, and successfully manages over 14,000 scientific facts (125k tokens), substantially surpassing the capacity and efficiency limits of conventional full-context approaches.

📝 Abstract

As Large Language Models (LLMs) evolve into persistent scientific collaborators, context window saturation has emerged as a critical bottleneck. Scientific workflows involving iterative data analysis and hypothesis refinement rapidly saturate even extended contexts with dense technical content, while monolithic approaches suffer from quadratic cost scaling and cognitive degradation. We evaluate a Dual Process Memory Architecture that decouples immediate episodic needs (constant 10-message window) from long-term consolidated knowledge (growing at approximately 3 tokens/message). Unlike prior social agent memory systems, our domain-specific consolidation addresses contradictory parameter evolution, multi-hop reasoning across experimental phases, and precise technical fact retention. Through large-scale evaluation spanning 15,000 messages with cross-model validation across six LLMs from three families (OpenAI, Anthropic, Google), totaling 1,440 queries, we establish three key findings. First, while full-context models fail at 10,000 messages due to context overflow, our system maintains 70-85% accuracy with 1-2 second latency using 62% fewer tokens (45,434 vs 120,000+ limit). Second, cross-model validation reveals architecture-level trade-offs independent of specific LLMs: Dual Process excels at numeric/temporal queries (65-90% accuracy) while RAG excels at historical retrieval (60-85%), suggesting complementary deployment strategies. Third, we identify a "Sim-to-Real" gap where synthetic tests maintain constant memory but realistic workflows exhibit linear growth (about 3 tokens/message), with consolidation quality emerging as the primary scalability bottleneck. The architecture successfully manages profiles with 14,000+ scientific facts (125k tokens), demonstrating that domain-specific memory consolidation enables sustained operation beyond full-context limits.

Problem

Research questions and friction points this paper is trying to address.

context window saturation

long-horizon scientific agents

memory consolidation

scientific workflows

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Episodic-Semantic Memory

Dual Process Architecture

Domain-specific Consolidation