π€ AI Summary
This work addresses the limitation of large language modelβdriven autonomous agents in lacking persistent procedural memory, which hinders their ability to reuse solutions across structurally similar tasks. To overcome this, the authors propose the APEX-EM framework, enabling online learning without modifying model weights through structured experience replay. The approach introduces three key innovations: a structured experience representation, the PRGII workflow, and a dual-outcome memory mechanism, collectively facilitating cross-domain transfer even when tasks share no lexical overlap but exhibit structural similarity. A hybrid retrieval strategy integrates semantic search, structural signature matching, and plan DAG traversal, augmented by a multidimensional reward validator and in-context learning with both positive and negative examples. Evaluated on KGQAGen-10k, BigCodeBench, and HLE benchmarks, APEX-EM achieves performance gains of 48.3, 29.4, and 22.8 percentage points, respectively, substantially surpassing existing methods and even oracle-based retrieval upper bounds.
π Abstract
LLM-based autonomous agents lack persistent procedural memory: they re-derive solutions from scratch even when structurally identical tasks have been solved before. We present \textbf{APEX-EM}, a non-parametric online learning framework that accumulates, retrieves, and reuses structured procedural plans without modifying model weights. APEX-EM introduces: (1) a \emph{structured experience representation} encoding the full procedural-episodic trace of each execution -- planning steps, artifacts, iteration history with error analysis, and quality scores; (2) a \emph{Plan-Retrieve-Generate-Iterate-Ingest} (PRGII) workflow with Task Verifiers providing multi-dimensional reward signals; and (3) a \emph{dual-outcome Experience Memory} with hybrid retrieval combining semantic search, structural signature matching, and plan DAG traversal -- enabling cross-domain transfer between tasks sharing no lexical overlap but analogous operational structure. Successful experiences serve as positive in-context examples; failures as negative examples with structured error annotations.
We evaluate on BigCodeBench~\cite{zhuo2025bigcodebench}, KGQAGen-10k~\cite{zhang2025kgqagen}, and Humanity's Last Exam~\cite{phan2025hle} using Claude Sonnet 4.5 and Opus 4.5. On KGQAGen-10k, APEX-EM achieves 89.6\% accuracy versus 41.3\% without memory (+48.3pp), surpassing the oracle-retrieval upper bound (84.9\%). On BigCodeBench, it reaches 83.3\% SR from a 53.9\% baseline (+29.4pp), exceeding MemRL's~\cite{memrl2025} +11.0pp gain under comparable frozen-backbone conditions (noting backbone differences controlled for in our analysis). On HLE, entity graph retrieval reaches 48.0\% from 25.2\% (+22.8pp). Ablations show component value is task-dependent: rich judge feedback is negligible for code generation but critical for structured queries (+10.3pp), while binary-signal iteration partially compensates for weaker feedback.