🤖 AI Summary
This study investigates “silent failures” in large language model agents—where fluent outputs mask ineffective extraction, retention, or retrieval of critical information across sessions. Through causal circuit tracing and feature space analysis of the Qwen-3 model series (0.6B–14B) under the mem0 and A-MEM frameworks, the work reveals that control circuits activate earlier than content circuits, with the latter converging on shared late-layer hubs that serve as contextual grounding anchors. It further identifies a scale-dependent threshold governing circuit detectability and intervenability. Remarkably, under unsupervised conditions, the approach achieves 76.2% accuracy in localizing memory-stage failures, demonstrating that smaller models possess routing capabilities but lack effective mechanisms for information extraction and contextual grounding.
📝 Abstract
Agent memory failures are silent: an LLM-based agent can produce a fluent response even when it fails to extract, retain, or retrieve the information needed across sessions. The write-manage-read loop describes the external pipeline of these systems but leaves open which internal computations implement each stage. Tracing internal feature circuits across the Qwen-3 family (0.6B--14B) and two memory frameworks (mem0 and A-MEM), we report three findings. First, control is detectable before content: routing circuitry is causally active at 0.6B, while content circuitry produces no detectable signal until 4B under our tracing setup, creating a deployment regime where small models route with apparent competence but silently fail at extraction and grounding. Second, within the content group, Write and Read share a late-layer hub that operates as a context-grounding substrate already present in the base model; only memory framing recruits a functional grounding direction on this substrate, and the hub transfers across both frameworks. Third, emergence does not imply steerability: although the content circuit becomes detectable at 4B, it becomes reliably steerable only at 8B, indicating that detection and intervention have distinct scale thresholds. As a practical implication, the feature-space separation between the two circuit groups enables per-operation failure localization at 76.2% accuracy without supervision, providing a stage-level diagnostic for otherwise silent agent-memory failures.